Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrying nodes hangs execution after #1669 #1693

Closed
simster7 opened this issue Oct 18, 2019 · 2 comments · Fixed by #1694
Closed

Retrying nodes hangs execution after #1669 #1693

simster7 opened this issue Oct 18, 2019 · 2 comments · Fixed by #1694
Labels
Milestone

Comments

@simster7
Copy link
Member

Builds on or after #1669 (30a91ef) hang the execution when retrying nodes.

Consider this example:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  templates:
  - name: hello-hello-hello
    steps:
    - - name: hello1
        template: whalesay

  - name: whalesay
    retryStrategy:
      limit: 3
    container:
      image: alpine
      command: ["sh", "-c"]
      args: ["exit 1"]

Here is the example running on 263cb70 (one commit before the offending commit):

Name:                steps-dlr2r
Namespace:           argo
ServiceAccount:      default
Status:              Failed
Message:             child 'steps-dlr2r-165846992' failed
Created:             Fri Oct 18 15:57:50 -0700 (30 seconds ago)
Started:             Fri Oct 18 15:57:50 -0700 (30 seconds ago)
Finished:            Fri Oct 18 15:58:20 -0700 (now)
Duration:            30 seconds

STEP                                PODNAME                 DURATION  MESSAGE
 ✖ steps-dlr2r (hello-hello-hello)                                    child 'steps-dlr2r-165846992' failed
 └---✖ hello1 (whalesay)                                              No more retries left
     ├-✖ hello1(0) (whalesay)       steps-dlr2r-3706177139  8s        failed with exit code 1
     ├-✖ hello1(1) (whalesay)       steps-dlr2r-15351222    6s        failed with exit code 1
     ├-✖ hello1(2) (whalesay)       steps-dlr2r-3907802757  5s        failed with exit code 1
     └-✖ hello1(3) (whalesay)       steps-dlr2r-1559186360  7s        failed with exit code 1

Here is the same example running on 30a91ef (the offending commit). The workflow runs to this screen and continues indefinitely until manually stopped:

Name:                steps-b652k
Namespace:           argo
ServiceAccount:      default
Status:              Running
Created:             Fri Oct 18 16:01:13 -0700 (1 minute ago)
Started:             Fri Oct 18 16:01:13 -0700 (1 minute ago)
Duration:            1 minute 0 seconds

STEP                                PODNAME                DURATION  MESSAGE
 ● steps-b652k (hello-hello-hello)
 └---✖ hello1(0) (whalesay)         steps-b652k-489349053  10s       failed with exit code 1

@dtaniwaki ?

@sarabala1979 Unless I'm missing something, this is seems like a major bug and should block our release until it is resolved or reverted.

@jessesuen
Copy link
Member

@sarabala1979 Unless I'm missing something, this is seems like a major bug and should block our release until it is resolved or reverted.

I agree

@jessesuen jessesuen added this to the v2.4 milestone Oct 18, 2019
@dtaniwaki
Copy link
Member

Let me check it. It’s too hard to check regression without e2e tests in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants