fix: add timeout for executor signal #13012

fyp711 · 2024-05-06T06:15:11Z

Fixes #13011

Motivation

Modifications

I added a context timeout to prevent spdy from blocking for a long time

fyp711 · 2024-05-06T06:31:39Z

Could someone take a discussion for this issue ? It will block all pod cleanup workers. It's very serious.

Joibel · 2024-05-07T14:25:52Z

workflow/signal/signal.go

@@ -15,6 +17,8 @@ import (
 	"github.com/argoproj/argo-workflows/v3/workflow/common"
 )

+var spdyTimeout = env.LookupEnvDurationOr("SPDY_TIMEOUT", 10*time.Minute)


The name of the environment variable to control this is difficult to 'discover'.

I'd suggest POD_OUTPUT_TIMEOUT or something along those lines. No user cares about the protocol in use.

This will also need adding to the documentation.

Thanks for your review, i will fix it. Thanks

EXECUTOR_SIGNAL_TIMEOUT was what I was thinking.

Although I'm not sure that this needs to be configurable in the first place, or as long as 10min.

It isn't entirely clear what we should do after a timeout error either though, requeueing may not fix it

It isn't entirely clear what we should do after a timeout error either though, requeueing may not fix it

Thanks for your advices, in our environment, we observe that the pod has actually been deleted, but this goroutine will still block here.

So, in my opinion, we should set a timeout that can never happen. The original purpose of this method change in client-go was also to support timeout cancelable

So, in my opinion, we should set a timeout that can never happen. The original purpose of this method change in client-go was also to support timeout cancelable

see kubernetes/kubernetes#103177

Thanks for your advices, in our environment, we observe that the pod has actually been deleted, but this goroutine will still block here.

Interesting, so a kubectl exec is sent to an already terminated Pod and the k8s API Server does not return?
Or the signal via exec works but the request does not complete for some reason?

@fyp711 you didn't answer the above; do you know why it works that way?

Interesting, so a kubectl exec is sent to an already terminated Pod and the k8s API Server does not return? Or the signal via exec works but the request does not complete for some reason?

@agilgur5 I think exec has been executed successfully, because that pod is already deleted. I think the reason is apiserver does not return. But actually, I don't know why apiserver didn't return, I tried to analyze it, but it was very difficult. Further k8s community related submissions are this link below. You can have a look. Thank you.
kubernetes/kubernetes#103177

Ok, gotcha. Alan and I were trying to understand why this was happening here and in #13011 (comment).

I read the k8s PR, however it similarly doesn't explain why. It seems like possibly a k8s API Server bug or client-go bug and there should really be an upstream issue for this that we can point to while we use this workaround.

In the interim, I've left a comment here noting that it's a workaround and referencing the PR

workflow/signal/signal.go

fyp711 · 2024-05-07T17:01:00Z

@Joibel @agilgur5 Hi, could you help me to review again? I have already fixed the issues mentioned earlier. Thanks!

Joibel · 2024-05-08T08:26:14Z

@Joibel @agilgur5 Hi, could you help me to review again? I have already fixed the issues mentioned earlier. Thanks!

Could you answer my question in #13011 on when you think this might be useful please.

fyp711 · 2024-05-08T08:33:18Z

@Joibel @agilgur5 Hi, could you help me to review again? I have already fixed the issues mentioned earlier. Thanks!

Could you answer my question in #13011 on when you think this might be useful please.

Okay, I missed that message. I'm sorry. Now I'll reply

docs/environment-variables.md

agilgur5 · 2024-05-08T20:38:24Z

workflow/signal/signal.go

@@ -15,6 +17,8 @@ import (
 	"github.com/argoproj/argo-workflows/v3/workflow/common"
 )

+var spdyTimeout = env.LookupEnvDurationOr("SPDY_TIMEOUT", 10*time.Minute)


Thanks for your advices, in our environment, we observe that the pod has actually been deleted, but this goroutine will still block here.

Interesting, so a kubectl exec is sent to an already terminated Pod and the k8s API Server does not return?
Or the signal via exec works but the request does not complete for some reason?

fyp711 · 2024-05-09T06:59:57Z

I updated the code, thanks for help me.

fyp711 · 2024-05-09T07:18:20Z

All CI's passed, very successful

Joibel

Thanks, looking good to me, just a minor coding style thing.

I'll let @agilgur5 and you decide on whether this should be configurable. I'm happy with it with the 2minute configurable timeout but don't have strong opinions either way.

workflow/signal/signal.go

agilgur5 · 2024-05-11T03:11:53Z

workflow/signal/signal.go

@@ -15,6 +17,8 @@ import (
 	"github.com/argoproj/argo-workflows/v3/workflow/common"
 )

+var spdyTimeout = env.LookupEnvDurationOr("SPDY_TIMEOUT", 10*time.Minute)


@fyp711 you didn't answer the above; do you know why it works that way?

docs/environment-variables.md

Signed-off-by: Yuping Fan <[email protected]>

workflow/signal/signal.go

Signed-off-by: Anton Gilgur <[email protected]>

agilgur5

LGTM. Thanks for reporting this odd bug with k8s and providing the workaround

agilgur5 · 2024-05-27T04:18:20Z

I wasn't able to backport this into v3.5.x as it has a merge conflict b/c it depends on #12858, which itself has a merge conflict b/c it depends on #12847 (which itself depends on other things that weren't backported to v3.5.x)

fyp711 · 2024-06-21T08:35:41Z

Why the v3.5.x branch is different from main branch?

fyp711 · 2024-06-21T08:39:26Z

Oh, i see you want to add #12847 into v3.6.0 Release.

agilgur5 · 2024-06-21T23:44:49Z

Correct, the main branch tracks development, which includes new features and breaking changes of any subsequent minor, and #12847 is a breaking change (as it may not work on older k8s)

fyp711 force-pushed the spdy branch from 7ffae92 to ae347d4 Compare May 6, 2024 06:29

Joibel requested changes May 7, 2024

View reviewed changes

Joibel self-assigned this May 7, 2024

Joibel added area/executor area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels May 7, 2024

agilgur5 changed the title ~~fix: add timeout for SPDY executor stream~~ fix: add timeout for executor signal May 7, 2024

fyp711 force-pushed the spdy branch 3 times, most recently from e0bb02e to b283bfd Compare May 7, 2024 16:17

fyp711 requested review from Joibel and agilgur5 May 8, 2024 02:23

Joibel requested changes May 8, 2024

View reviewed changes

docs/environment-variables.md Outdated Show resolved Hide resolved

fyp711 force-pushed the spdy branch 3 times, most recently from 6d1b39e to 00f149e Compare May 8, 2024 12:50

fyp711 requested a review from Joibel May 8, 2024 15:02

agilgur5 self-assigned this May 8, 2024

agilgur5 reviewed May 8, 2024

View reviewed changes

fyp711 force-pushed the spdy branch from 00f149e to 9e23ba2 Compare May 9, 2024 06:51

fyp711 requested a review from agilgur5 May 9, 2024 07:17

Joibel requested changes May 9, 2024

View reviewed changes

workflow/signal/signal.go Outdated Show resolved Hide resolved

fyp711 force-pushed the spdy branch from 9e23ba2 to bc7525d Compare May 9, 2024 07:28

Joibel approved these changes May 9, 2024

View reviewed changes

agilgur5 reviewed May 11, 2024

View reviewed changes

fix: add timeout for executor signal

9efa627

Signed-off-by: Yuping Fan <[email protected]>

fyp711 force-pushed the spdy branch from bc7525d to 9efa627 Compare May 11, 2024 03:20

fyp711 commented May 11, 2024

View reviewed changes

workflow/signal/signal.go Show resolved Hide resolved

fyp711 requested a review from agilgur5 May 11, 2024 03:29

agilgur5 added area/controller Controller issues, panics and removed area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels May 16, 2024

agilgur5 reviewed May 16, 2024

View reviewed changes

workflow/signal/signal.go Show resolved Hide resolved

add workaround comment with link to upstream

ac57e06

Signed-off-by: Anton Gilgur <[email protected]>

agilgur5 approved these changes May 16, 2024

View reviewed changes

agilgur5 enabled auto-merge (squash) May 16, 2024 00:53

agilgur5 merged commit adef075 into argoproj:main May 16, 2024
28 checks passed

agilgur5 added this to the v3.5.x patches milestone May 16, 2024

agilgur5 removed this from the v3.5.x patches milestone May 27, 2024

agilgur5 mentioned this pull request May 27, 2024

Release v3.5 patch releases discussion #11997

Open

agilgur5 added this to the v3.6.0 milestone Jun 21, 2024

agilgur5 mentioned this pull request Sep 21, 2024

signaled container: wait error: unable to upgrade connection: container not found. node Succeeded but wf not progressing #13627

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add timeout for executor signal #13012

fix: add timeout for executor signal #13012

fyp711 commented May 6, 2024

fyp711 commented May 6, 2024

Joibel May 7, 2024

fyp711 May 7, 2024

agilgur5 May 7, 2024 •

edited

Loading

fyp711 May 7, 2024 •

edited

Loading

fyp711 May 7, 2024 •

edited

Loading

fyp711 May 7, 2024

agilgur5 May 8, 2024

agilgur5 May 11, 2024

fyp711 May 15, 2024 •

edited

Loading

agilgur5 May 16, 2024 •

edited

Loading

fyp711 commented May 7, 2024

Joibel commented May 8, 2024

fyp711 commented May 8, 2024

agilgur5 May 8, 2024

fyp711 commented May 9, 2024

fyp711 commented May 9, 2024

Joibel left a comment

agilgur5 May 11, 2024

agilgur5 left a comment

agilgur5 commented May 27, 2024

fyp711 commented Jun 21, 2024 •

edited by agilgur5

Loading

fyp711 commented Jun 21, 2024 •

edited by agilgur5

Loading

agilgur5 commented Jun 21, 2024

fix: add timeout for executor signal #13012

fix: add timeout for executor signal #13012

Conversation

fyp711 commented May 6, 2024

Motivation

Modifications

fyp711 commented May 6, 2024

Joibel May 7, 2024

Choose a reason for hiding this comment

fyp711 May 7, 2024

Choose a reason for hiding this comment

agilgur5 May 7, 2024 • edited Loading

Choose a reason for hiding this comment

fyp711 May 7, 2024 • edited Loading

Choose a reason for hiding this comment

fyp711 May 7, 2024 • edited Loading

Choose a reason for hiding this comment

fyp711 May 7, 2024

Choose a reason for hiding this comment

agilgur5 May 8, 2024

Choose a reason for hiding this comment

agilgur5 May 11, 2024

Choose a reason for hiding this comment

fyp711 May 15, 2024 • edited Loading

Choose a reason for hiding this comment

agilgur5 May 16, 2024 • edited Loading

Choose a reason for hiding this comment

fyp711 commented May 7, 2024

Joibel commented May 8, 2024

fyp711 commented May 8, 2024

agilgur5 May 8, 2024

Choose a reason for hiding this comment

fyp711 commented May 9, 2024

fyp711 commented May 9, 2024

Joibel left a comment

Choose a reason for hiding this comment

agilgur5 May 11, 2024

Choose a reason for hiding this comment

agilgur5 left a comment

Choose a reason for hiding this comment

agilgur5 commented May 27, 2024

fyp711 commented Jun 21, 2024 • edited by agilgur5 Loading

fyp711 commented Jun 21, 2024 • edited by agilgur5 Loading

agilgur5 commented Jun 21, 2024

agilgur5 May 7, 2024 •

edited

Loading

fyp711 May 7, 2024 •

edited

Loading

fyp711 May 7, 2024 •

edited

Loading

fyp711 May 15, 2024 •

edited

Loading

agilgur5 May 16, 2024 •

edited

Loading

fyp711 commented Jun 21, 2024 •

edited by agilgur5

Loading

fyp711 commented Jun 21, 2024 •

edited by agilgur5

Loading