Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Workflow stuck RUNNING after all tasks complete #3491

Open
sawmonaco opened this issue Feb 14, 2023 · 1 comment
Open

Workflow stuck RUNNING after all tasks complete #3491

sawmonaco opened this issue Feb 14, 2023 · 1 comment
Labels
type: bug bugs/ bug fixes

Comments

@sawmonaco
Copy link

Describe the bug
Hi -- I'm experiencing a situation that occasionally workflows will have all of their tasks complete, but the workflow itself doesn't end up complete. I was only able to replicate the issue on a conductor server that had a concurrency > 1. When it was just a single pod, the issue did not appear to happen.

If you look at the logs at the bottom of the issue, the 4th from bottom sets the workflow to COMPLETED. Then the 2nd from bottom changes it back to RUNNING.

Details
Conductor version: 3.11.0
Persistence implementation: Postgres
Queue implementation: Postgres
Lock: Redis
Workflow definition: parent workflow & child workflow
Task definition: https://gist.github.com/sawmonaco/84762ba10b7fbf6518b894415c41d055
Event handler definition:
Worker: https://gist.github.com/sawmonaco/72b2f0161a5f0559cc4692b10bc4d0cb#file-conductor_worker-py

To Reproduce
Steps to reproduce the behavior:

  1. I started up 300 of the parent workflows
    for _ in range(300):
      foo = requests.post("<server>/api/workflow/test_concurrency_parent", json={})
      assert foo.status_code == 200, foo.content
    
  2. Wait some time until everything looks like it's done processing in the worker
  3. Search for any running tasks -- I found 2 parent workflows and 2 child workflows (which were unrelated to each other)

Expected behavior
I expect all to finish

Screenshots
image

Additional context

Relevant logs:

[
    {
        "@timestamp": "2023-02-14 18:28:10.427",
        "kubernetes.pod_name": "conductor-server-5b86d8df4c-2w75r",
        "log": "2023/02/14 18:28:10  -> PUT; 127.0.0.1:42094; /conductor_workflow/_doc/883af765-815d-4a77-a8ab-e22dd5b6fb55?timeout=1m; {\"workflowType\":\"test_concurrency\",\"version\":9,\"workflowId\":\"883af765-815d-4a77-a8ab-e22dd5b6fb55\",\"startTime\":\"2023-02-14T18:28:10.406Z\",\"status\":\"RUNNING\",\"input\":\"{subWorkflowDefinition=null, workflowInput={}, subWorkflowTaskToDomain=null, subWorkflowName=test_concurrency, subWorkflowVersion=9}\",\"output\":\"{}\",\"executionTime\":0,\"failedReferenceTaskNames\":\"\",\"priority\":0,\"failedTaskNames\":[],\"outputSize\":2,\"inputSize\":132}; 201; 0.013s\n"
    },
    {
        "@timestamp": "2023-02-14 18:28:10.705",
        "kubernetes.pod_name": "conductor-server-5b86d8df4c-2w75r",
        "log": "2023/02/14 18:28:10  -> PUT; 127.0.0.1:42094; /conductor_workflow/_doc/883af765-815d-4a77-a8ab-e22dd5b6fb55?timeout=1m; {\"workflowType\":\"test_concurrency\",\"version\":9,\"workflowId\":\"883af765-815d-4a77-a8ab-e22dd5b6fb55\",\"startTime\":\"2023-02-14T18:28:10.406Z\",\"status\":\"RUNNING\",\"input\":\"{subWorkflowDefinition=null, workflowInput={}, subWorkflowTaskToDomain=null, subWorkflowName=test_concurrency, subWorkflowVersion=9}\",\"output\":\"{}\",\"executionTime\":0,\"failedReferenceTaskNames\":\"\",\"priority\":0,\"failedTaskNames\":[],\"outputSize\":2,\"inputSize\":132}; 200; 0.012s\n"
    },
    {
        "@timestamp": "2023-02-14 18:31:05.154",
        "kubernetes.pod_name": "conductor-server-5b86d8df4c-m295p",
        "log": "2023-02-14 18:31:05.153  INFO 6 --- [nio-8080-exec-9] c.n.c.c.e.WorkflowExecutor               : Pushed workflow 883af765-815d-4a77-a8ab-e22dd5b6fb55 to _deciderQueue for expedited evaluation\n"
    },
    {
        "@timestamp": "2023-02-14 18:31:06.046",
        "kubernetes.pod_name": "conductor-server-5b86d8df4c-2w75r",
        "log": "2023-02-14 18:31:06.046  INFO 7 --- [nio-8080-exec-7] c.n.c.c.e.WorkflowExecutor               : Pushed workflow 883af765-815d-4a77-a8ab-e22dd5b6fb55 to _deciderQueue for expedited evaluation\n"
    },
    {
        "@timestamp": "2023-02-14 18:31:06.219",
        "kubernetes.pod_name": "conductor-server-5b86d8df4c-2w75r",
        "log": "2023/02/14 18:31:06  -> PUT; 127.0.0.1:53772; /conductor_workflow/_doc/883af765-815d-4a77-a8ab-e22dd5b6fb55?timeout=1m; {\"workflowType\":\"test_concurrency\",\"version\":9,\"workflowId\":\"883af765-815d-4a77-a8ab-e22dd5b6fb55\",\"startTime\":\"2023-02-14T18:28:10.406Z\",\"endTime\":\"2023-02-14T18:31:06.201Z\",\"status\":\"COMPLETED\",\"input\":\"{subWorkflowDefinition=null, workflowInput={}, subWorkflowTaskToDomain=null, subWorkflowName=test_concurrency, subWorkflowVersion=9}\",\"output\":\"{data=}\",\"executionTime\":175795,\"failedReferenceTaskNames\":\"\",\"priority\":0,\"failedTaskNames\":[],\"outputSize\":7,\"inputSize\":132}; 200; 0.012s\n"
    },
    {
        "@timestamp": "2023-02-14 18:31:06.238",
        "kubernetes.pod_name": "conductor-server-5b86d8df4c-2w75r",
        "log": "2023-02-14 18:31:06.238  INFO 7 --- [weeper-thread-2] c.n.c.c.e.WorkflowExecutor               : test_concurrency.9/883af765-815d-4a77-a8ab-e22dd5b6fb55 updated parent f992ccc8-e1cc-4ec3-9e07-2611cfdbfc61 task ae1ace3f-e37f-4877-9df4-f7078411a247\n"
    },
    {
        "@timestamp": "2023-02-14 18:31:06.537",
        "kubernetes.pod_name": "conductor-server-5b86d8df4c-m295p",
        "log": "2023/02/14 18:31:06  -> PUT; 127.0.0.1:41164; /conductor_workflow/_doc/883af765-815d-4a77-a8ab-e22dd5b6fb55?timeout=1m; {\"workflowType\":\"test_concurrency\",\"version\":9,\"workflowId\":\"883af765-815d-4a77-a8ab-e22dd5b6fb55\",\"startTime\":\"2023-02-14T18:28:10.406Z\",\"status\":\"RUNNING\",\"input\":\"{subWorkflowDefinition=null, workflowInput={}, subWorkflowTaskToDomain=null, subWorkflowName=test_concurrency, subWorkflowVersion=9}\",\"output\":\"{}\",\"executionTime\":0,\"failedReferenceTaskNames\":\"\",\"priority\":0,\"failedTaskNames\":[],\"outputSize\":2,\"inputSize\":132}; 200; 0.012s\n"
    },
    {
        "@timestamp": "2023-02-14 18:31:07.956",
        "kubernetes.pod_name": "conductor-server-5b86d8df4c-m295p",
        "log": "2023/02/14 18:31:07  -> PUT; 127.0.0.1:41164; /conductor_workflow/_doc/f992ccc8-e1cc-4ec3-9e07-2611cfdbfc61?timeout=1m; {\"workflowType\":\"test_concurrency_parent\",\"version\":3,\"workflowId\":\"f992ccc8-e1cc-4ec3-9e07-2611cfdbfc61\",\"startTime\":\"2023-02-14T18:28:09.119Z\",\"endTime\":\"2023-02-14T18:31:07.931Z\",\"status\":\"COMPLETED\",\"input\":\"{}\",\"output\":\"{sub-workflow-1={subWorkflowId=54d36e05-6271-4ab1-bdd6-34a2bad55e7e, data=}, sub-workflow-2={subWorkflowId=883af765-815d-4a77-a8ab-e22dd5b6fb55, data=}}\",\"executionTime\":178812,\"failedReferenceTaskNames\":\"\",\"priority\":0,\"failedTaskNames\":[],\"outputSize\":152,\"inputSize\":2}; 200; 0.018s\n"
    }
]
@sawmonaco sawmonaco added the type: bug bugs/ bug fixes label Feb 14, 2023
@manan164
Copy link
Contributor

Hi @sawmonaco , thanks for reporting. We will push the fix if required. In the meantime, we can chat here for more real-time collaboration.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug bugs/ bug fixes
Projects
None yet
Development

No branches or pull requests

2 participants