Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add timeout for executor signal #13012

Merged
merged 2 commits into from
May 16, 2024
Merged

fix: add timeout for executor signal #13012

merged 2 commits into from
May 16, 2024

Conversation

fyp711
Copy link
Contributor

@fyp711 fyp711 commented May 6, 2024

Fixes #13011

Motivation

Modifications

I added a context timeout to prevent spdy from blocking for a long time

@fyp711
Copy link
Contributor Author

fyp711 commented May 6, 2024

Could someone take a discussion for this issue ? It will block all pod cleanup workers. It's very serious.

@@ -15,6 +17,8 @@ import (
"github.com/argoproj/argo-workflows/v3/workflow/common"
)

var spdyTimeout = env.LookupEnvDurationOr("SPDY_TIMEOUT", 10*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the environment variable to control this is difficult to 'discover'.

I'd suggest POD_OUTPUT_TIMEOUT or something along those lines. No user cares about the protocol in use.

This will also need adding to the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, i will fix it. Thanks

Copy link
Contributor

@agilgur5 agilgur5 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXECUTOR_SIGNAL_TIMEOUT was what I was thinking.

Although I'm not sure that this needs to be configurable in the first place, or as long as 10min.

It isn't entirely clear what we should do after a timeout error either though, requeueing may not fix it

Copy link
Contributor Author

@fyp711 fyp711 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't entirely clear what we should do after a timeout error either though, requeueing may not fix it

Thanks for your advices, in our environment, we observe that the pod has actually been deleted, but this goroutine will still block here.

Copy link
Contributor Author

@fyp711 fyp711 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in my opinion, we should set a timeout that can never happen. The original purpose of this method change in client-go was also to support timeout cancelable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in my opinion, we should set a timeout that can never happen. The original purpose of this method change in client-go was also to support timeout cancelable

see kubernetes/kubernetes#103177

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advices, in our environment, we observe that the pod has actually been deleted, but this goroutine will still block here.

Interesting, so a kubectl exec is sent to an already terminated Pod and the k8s API Server does not return?
Or the signal via exec works but the request does not complete for some reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fyp711 you didn't answer the above; do you know why it works that way?

Copy link
Contributor Author

@fyp711 fyp711 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so a kubectl exec is sent to an already terminated Pod and the k8s API Server does not return? Or the signal via exec works but the request does not complete for some reason?

@agilgur5 I think exec has been executed successfully, because that pod is already deleted. I think the reason is apiserver does not return. But actually, I don't know why apiserver didn't return, I tried to analyze it, but it was very difficult. Further k8s community related submissions are this link below. You can have a look. Thank you.
kubernetes/kubernetes#103177

Copy link
Contributor

@agilgur5 agilgur5 May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, gotcha. Alan and I were trying to understand why this was happening here and in #13011 (comment).

I read the k8s PR, however it similarly doesn't explain why. It seems like possibly a k8s API Server bug or client-go bug and there should really be an upstream issue for this that we can point to while we use this workaround.

In the interim, I've left a comment here noting that it's a workaround and referencing the PR

workflow/signal/signal.go Outdated Show resolved Hide resolved
@Joibel Joibel self-assigned this May 7, 2024
@Joibel Joibel added area/executor area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels May 7, 2024
@agilgur5 agilgur5 changed the title fix: add timeout for SPDY executor stream fix: add timeout for executor signal May 7, 2024
@fyp711 fyp711 force-pushed the spdy branch 3 times, most recently from e0bb02e to b283bfd Compare May 7, 2024 16:17
@fyp711
Copy link
Contributor Author

fyp711 commented May 7, 2024

@Joibel @agilgur5 Hi, could you help me to review again? I have already fixed the issues mentioned earlier. Thanks!

@fyp711 fyp711 requested review from Joibel and agilgur5 May 8, 2024 02:23
@Joibel
Copy link
Member

Joibel commented May 8, 2024

@Joibel @agilgur5 Hi, could you help me to review again? I have already fixed the issues mentioned earlier. Thanks!

Could you answer my question in #13011 on when you think this might be useful please.

@fyp711
Copy link
Contributor Author

fyp711 commented May 8, 2024

@Joibel @agilgur5 Hi, could you help me to review again? I have already fixed the issues mentioned earlier. Thanks!

Could you answer my question in #13011 on when you think this might be useful please.

Okay, I missed that message. I'm sorry. Now I'll reply

docs/environment-variables.md Outdated Show resolved Hide resolved
@fyp711 fyp711 force-pushed the spdy branch 3 times, most recently from 6d1b39e to 00f149e Compare May 8, 2024 12:50
@fyp711 fyp711 requested a review from Joibel May 8, 2024 15:02
@agilgur5 agilgur5 self-assigned this May 8, 2024
docs/environment-variables.md Outdated Show resolved Hide resolved
@@ -15,6 +17,8 @@ import (
"github.com/argoproj/argo-workflows/v3/workflow/common"
)

var spdyTimeout = env.LookupEnvDurationOr("SPDY_TIMEOUT", 10*time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advices, in our environment, we observe that the pod has actually been deleted, but this goroutine will still block here.

Interesting, so a kubectl exec is sent to an already terminated Pod and the k8s API Server does not return?
Or the signal via exec works but the request does not complete for some reason?

@fyp711
Copy link
Contributor Author

fyp711 commented May 9, 2024

I updated the code, thanks for help me.

@fyp711 fyp711 requested a review from agilgur5 May 9, 2024 07:17
@fyp711
Copy link
Contributor Author

fyp711 commented May 9, 2024

All CI's passed, very successful

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looking good to me, just a minor coding style thing.

I'll let @agilgur5 and you decide on whether this should be configurable. I'm happy with it with the 2minute configurable timeout but don't have strong opinions either way.

workflow/signal/signal.go Outdated Show resolved Hide resolved
workflow/signal/signal.go Outdated Show resolved Hide resolved
@@ -15,6 +17,8 @@ import (
"github.com/argoproj/argo-workflows/v3/workflow/common"
)

var spdyTimeout = env.LookupEnvDurationOr("SPDY_TIMEOUT", 10*time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fyp711 you didn't answer the above; do you know why it works that way?

docs/environment-variables.md Outdated Show resolved Hide resolved
@fyp711 fyp711 requested a review from agilgur5 May 11, 2024 03:29
@agilgur5 agilgur5 added area/controller Controller issues, panics and removed area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels May 16, 2024
Copy link
Contributor

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for reporting this odd bug with k8s and providing the workaround

@agilgur5 agilgur5 enabled auto-merge (squash) May 16, 2024 00:53
@agilgur5 agilgur5 merged commit adef075 into argoproj:main May 16, 2024
28 checks passed
@agilgur5 agilgur5 added this to the v3.5.x patches milestone May 16, 2024
@agilgur5 agilgur5 removed this from the v3.5.x patches milestone May 27, 2024
@agilgur5
Copy link
Contributor

I wasn't able to backport this into v3.5.x as it has a merge conflict b/c it depends on #12858, which itself has a merge conflict b/c it depends on #12847 (which itself depends on other things that weren't backported to v3.5.x)

@fyp711
Copy link
Contributor Author

fyp711 commented Jun 21, 2024

Why the v3.5.x branch is different from main branch?

@fyp711
Copy link
Contributor Author

fyp711 commented Jun 21, 2024

Oh, i see you want to add #12847 into v3.6.0 Release.

@agilgur5 agilgur5 added this to the v3.6.0 milestone Jun 21, 2024
@agilgur5
Copy link
Contributor

Correct, the main branch tracks development, which includes new features and breaking changes of any subsequent minor, and #12847 is a breaking change (as it may not work on older k8s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/executor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add timeout for executor signal
3 participants