-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add timeout for executor signal #13012
Conversation
Could someone take a discussion for this issue ? It will block all pod cleanup workers. It's very serious. |
workflow/signal/signal.go
Outdated
@@ -15,6 +17,8 @@ import ( | |||
"github.com/argoproj/argo-workflows/v3/workflow/common" | |||
) | |||
|
|||
var spdyTimeout = env.LookupEnvDurationOr("SPDY_TIMEOUT", 10*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the environment variable to control this is difficult to 'discover'.
I'd suggest POD_OUTPUT_TIMEOUT
or something along those lines. No user cares about the protocol in use.
This will also need adding to the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review, i will fix it. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EXECUTOR_SIGNAL_TIMEOUT
was what I was thinking.
Although I'm not sure that this needs to be configurable in the first place, or as long as 10min.
It isn't entirely clear what we should do after a timeout error either though, requeueing may not fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't entirely clear what we should do after a timeout error either though, requeueing may not fix it
Thanks for your advices, in our environment, we observe that the pod has actually been deleted, but this goroutine will still block here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in my opinion, we should set a timeout that can never happen. The original purpose of this method change in client-go was also to support timeout cancelable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in my opinion, we should set a timeout that can never happen. The original purpose of this method change in client-go was also to support timeout cancelable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advices, in our environment, we observe that the pod has actually been deleted, but this goroutine will still block here.
Interesting, so a kubectl exec
is sent to an already terminated Pod and the k8s API Server does not return?
Or the signal via exec
works but the request does not complete for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fyp711 you didn't answer the above; do you know why it works that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so a
kubectl exec
is sent to an already terminated Pod and the k8s API Server does not return? Or the signal viaexec
works but the request does not complete for some reason?
@agilgur5 I think exec
has been executed successfully, because that pod is already deleted. I think the reason is apiserver does not return. But actually, I don't know why apiserver didn't return, I tried to analyze it, but it was very difficult. Further k8s community related submissions are this link below. You can have a look. Thank you.
kubernetes/kubernetes#103177
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, gotcha. Alan and I were trying to understand why this was happening here and in #13011 (comment).
I read the k8s PR, however it similarly doesn't explain why. It seems like possibly a k8s API Server bug or client-go
bug and there should really be an upstream issue for this that we can point to while we use this workaround.
In the interim, I've left a comment here noting that it's a workaround and referencing the PR
e0bb02e
to
b283bfd
Compare
6d1b39e
to
00f149e
Compare
workflow/signal/signal.go
Outdated
@@ -15,6 +17,8 @@ import ( | |||
"github.com/argoproj/argo-workflows/v3/workflow/common" | |||
) | |||
|
|||
var spdyTimeout = env.LookupEnvDurationOr("SPDY_TIMEOUT", 10*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advices, in our environment, we observe that the pod has actually been deleted, but this goroutine will still block here.
Interesting, so a kubectl exec
is sent to an already terminated Pod and the k8s API Server does not return?
Or the signal via exec
works but the request does not complete for some reason?
I updated the code, thanks for help me. |
All CI's passed, very successful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looking good to me, just a minor coding style thing.
I'll let @agilgur5 and you decide on whether this should be configurable. I'm happy with it with the 2minute configurable timeout but don't have strong opinions either way.
workflow/signal/signal.go
Outdated
@@ -15,6 +17,8 @@ import ( | |||
"github.com/argoproj/argo-workflows/v3/workflow/common" | |||
) | |||
|
|||
var spdyTimeout = env.LookupEnvDurationOr("SPDY_TIMEOUT", 10*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fyp711 you didn't answer the above; do you know why it works that way?
Signed-off-by: Yuping Fan <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for reporting this odd bug with k8s and providing the workaround
Why the v3.5.x branch is different from main branch? |
Oh, i see you want to add #12847 into v3.6.0 Release. |
Correct, the |
Fixes #13011
Motivation
Modifications
I added a context timeout to prevent spdy from blocking for a long time