Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Fix double execution of Async system tasks when RepairService is enabled #3836

Merged
merged 1 commit into from
Nov 12, 2023

Conversation

VerstraeteBert
Copy link
Contributor

@VerstraeteBert VerstraeteBert commented Nov 3, 2023

EDIT: there's another race condition somewhere that we haven't been able to pinpoint. The reasoning below still holds, but does not fully solve the problem described.

Pull Request type

  • Bugfix
  • Feature
  • Refactoring (no functional changes, no api changes)
  • Build related changes (Please run ./gradlew generateLock saveLock to refresh dependencies)
  • WHOSUSING.md
  • Other (please describe):

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

Issue Summary:
There's a race condition in the system involving async system tasks and the WorkflowRepairService. For example, when a SUB_WORKFLOW task starts, the WorkflowRepairService sometimes erroneously reinserts the task into the processing queue because it perceives the task as out-of-sync between the ExecutorDAO and the queueDAO. This issue stems from the AsyncSystemTaskExecutor updating a task's status only after it removes it from the queue, creating a window where the WorkflowRepairService can wrongly assess the task state. This leads to duplicate subworkflows/http/… tasks being executed concurrently, which complicates maintaining idempotency of Tasks.

Proposed Solution:
To resolve the issue, it's suggested that the AsyncSystemTaskExecutor should update the status of tasks before removing them from the queue. This should close the window where the WorkflowRepairService can misidentify the task state and prevent unnecessary re-queuing of tasks. An edge case we’ve considered is if the process crashes after the task is updated but before it's removed from the queue. If that happens, the executor will simply remove the task from the queue the next time it runs, thereby not affecting system correctness.

Alternatives considered

Making the SubWorkflow and HTTP task sync, which would put more pressure on the decide loop.

Issue Summary:
There's a race condition in the system involving async system tasks and the WorkflowRepairService. For example, when a SUB_WORKFLOW task starts, the WorkflowRepairService sometimes erroneously reinserts the task into the processing queue because it perceives the task as out-of-sync between the ExecutorDAO and the queueDAO. This issue stems from the AsyncSystemTaskExecutor updating a task's status only after it removes it from the queue, creating a window where the WorkflowRepairService can wrongly assess the task state. This leads to duplicate subworkflows/http/… tasks being executed concurrently, which complicates maintaining idempotency of Tasks.

Proposed Solution:
To resolve the issue, it's suggested that the AsyncSystemTaskExecutor should update the status of tasks before removing them from the queue. This should close the window where the WorkflowRepairService can misidentify the task state and prevent unnecessary re-queuing of tasks. An edge case we’ve considered is if the process crashes after the task is updated but before it's removed from the queue. If that happens, the executor will simply remove the task from the queue the next time it runs, thereby not affecting system correctness.

Co-authored-by: Jaim Silva <[email protected]>
@VerstraeteBert VerstraeteBert marked this pull request as draft November 3, 2023 09:55
@VerstraeteBert VerstraeteBert changed the title Fix WorkflowRepairService and AsyncSystemTaskExecutor race condition Fix double execution of Async system tasks when RepairService is enabled Nov 3, 2023
@VerstraeteBert VerstraeteBert marked this pull request as ready for review November 9, 2023 09:14
@v1r3n v1r3n merged commit de2ca10 into Netflix:main Nov 12, 2023
2 checks passed
@dcore94
Copy link
Contributor

dcore94 commented Nov 13, 2023

Might this also be solving this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants