Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow stuck in running state when all steps failed #965

Closed
DmitryBe opened this issue Aug 23, 2018 · 4 comments
Closed

workflow stuck in running state when all steps failed #965

DmitryBe opened this issue Aug 23, 2018 · 4 comments

Comments

@DmitryBe
Copy link

DmitryBe commented Aug 23, 2018

BUG

workflow stuck in running state when all steps failed

Workflow configuration:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-with-steps-
spec:
  entrypoint: retry-with-steps
  templates:
  - name: retry-with-steps
    steps:
    - - name: hello2a
        template: random-fail
      - name: hello2b
        template: random-fail
  - name: random-fail
    retryStrategy:
      limit: 5
    container:
      image: python:alpine3.6
      command: [python, -c]
      # fail with a 66% probability
      args: ["import random; import sys; exit_code = random.choice([1, 1, 1]); sys.exit(exit_code)"]

Details:

argo -n playground get retry-with-steps-hp5gs
Name:                retry-with-steps-hp5gs
Namespace:           playground
ServiceAccount:      default
Status:              Running
Created:             Thu Aug 23 17:36:41 +0800 (6 minutes ago)
Started:             Thu Aug 23 17:36:41 +0800 (6 minutes ago)
Duration:            6 minutes 51 seconds

STEP                       PODNAME                            DURATION  MESSAGE
 ● retry-with-steps-hp5gs
 └-·-✖ hello2a                                                          No more retries left
   | ├-✖ hello2a(0)        retry-with-steps-hp5gs-4203507246  2s        failed with exit code 1
   | ├-✖ hello2a(1)        retry-with-steps-hp5gs-3599365867  2s        failed with exit code 1
   | ├-✖ hello2a(2)        retry-with-steps-hp5gs-1989258896  3s        failed with exit code 1
   | ├-✖ hello2a(3)        retry-with-steps-hp5gs-1653559421  3s        failed with exit code 1
   | ├-✖ hello2a(4)        retry-with-steps-hp5gs-2190884514  3s        failed with exit code 1
   | └-✖ hello2a(5)        retry-with-steps-hp5gs-1586743135  2s        failed with exit code 1
   └-● hello2b
     ├-✖ hello2b(0)        retry-with-steps-hp5gs-3155819765  2s        failed with exit code 1
     ├-✖ hello2b(1)        retry-with-steps-hp5gs-807203368   3s        failed with exit code 1
     ├-✖ hello2b(2)        retry-with-steps-hp5gs-2954194147  2s        failed with exit code 1
     ├-✖ hello2b(3)        retry-with-steps-hp5gs-2484567910  2s        failed with exit code 1
     ├-✖ hello2b(4)        retry-with-steps-hp5gs-2485406553  2s        failed with exit code 1
     └-✖ hello2b(5)        retry-with-steps-hp5gs-3089547932  1s        failed with exit code 1

Controller log:

time="2018-08-23T09:39:04Z" level=info msg="Workflow step group node retry-with-steps-hp5gs[0] (retry-with-steps-hp5gs-371483008) not yet completed" namespace=playground workflow=retry-with-steps-hp5gs

Note:
there is pod retry-with-steps-hp5gs-371483008 was found

What you expected to happen:

Environment:

  • Argo version:
argo version
argo: v2.1.1
  BuildDate: 2018-05-29T20:38:37Z
  GitCommit: ac241c95c13f08e868cd6f5ee32c9ce273e239ff
  GitTreeState: clean
  GitTag: v2.1.1
  GoVersion: go1.9.3
  Compiler: gc
  Platform: darwin/amd64

  • Kubernetes version :
    kubectl version
    Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-05-12T04:12:07Z", GoVersion:"go1.9.6", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
$ kubectl version -o yaml
@jessesuen
Copy link
Member

@DmitryBe - it looks like this was probably addressed in upcoming v2.2 release. I ran the workflow and it's reporting failed properly:

$ argo get retry-with-steps-zrfw4
Name:                retry-with-steps-zrfw4
Namespace:           default
ServiceAccount:      default
Status:              Failed
Message:             child 'retry-with-steps-zrfw4-183359371' failed
Created:             Thu Aug 23 02:58:21 -0700 (55 seconds ago)
Started:             Thu Aug 23 02:58:24 -0700 (52 seconds ago)
Finished:            Thu Aug 23 02:59:15 -0700 (1 second ago)
Duration:            51 seconds

STEP                       PODNAME                            DURATION  MESSAGE
 ✖ retry-with-steps-zrfw4                                               child 'retry-with-steps-zrfw4-183359371' failed
 └-·-✖ hello2a                                                          No more retries left
   | ├-✖ hello2a(0)        retry-with-steps-zrfw4-789508230   9s        failed with exit code 1
   | ├-✖ hello2a(1)        retry-with-steps-zrfw4-3406566531  7s        failed with exit code 1
   | ├-✖ hello2a(2)        retry-with-steps-zrfw4-1259575752  6s        failed with exit code 1
   | ├-✖ hello2a(3)        retry-with-steps-zrfw4-1460760085  6s        failed with exit code 1
   | ├-✖ hello2a(4)        retry-with-steps-zrfw4-3608736602  9s        failed with exit code 1
   | └-✖ hello2a(5)        retry-with-steps-zrfw4-3004595223  6s        failed with exit code 1
   └-✖ hello2b                                                          No more retries left
     ├-✖ hello2b(0)        retry-with-steps-zrfw4-294270029   7s        failed with exit code 1
     ├-✖ hello2b(1)        retry-with-steps-zrfw4-1703737120  8s        failed with exit code 1
     ├-✖ hello2b(2)        retry-with-steps-zrfw4-92644411    6s        failed with exit code 1
     ├-✖ hello2b(3)        retry-with-steps-zrfw4-3917985470  6s        failed with exit code 1
     ├-✖ hello2b(4)        retry-with-steps-zrfw4-3918824113  7s        failed with exit code 1
     └-✖ hello2b(5)        retry-with-steps-zrfw4-227998196   6s        failed with exit code 1

@jessesuen
Copy link
Member

Is this 100% reproducible for you on v2.1.1?

@DmitryBe
Copy link
Author

It is 100% reproducible;
I noticed that the configuration with a head task before hello2a/b works fine.

@jessesuen
Copy link
Member

Ok, if it's reproducible on v2.1.1 but not on tip, it appears to be fixed. There were a lot of node metadata fixes that went in that likely addressed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants