Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Honor non-determinism fail workflow policy #1287

Conversation

taylanisikdemir
Copy link
Member

@taylanisikdemir taylanisikdemir commented Nov 3, 2023

What changed?
Users can specify NonDeterministicWorkflowPolicy in worker options. If the FailWorkflow policy is chosen the workflow is expected to terminate as soon as it ends up with a nondeterministic state (e.g. activity order changed).
However this wasn't honored for a category of nondeterminism cases. This PR addresses it and workflows fail once any nondeterminism scenario is encountered.

There are two categories of nondeterminism cases in terms of how they get detected by client library:

  1. Issue bubbles up as illegal state panic to the task handler. Most actual prod cases.
  2. Issue is caught when comparing replay decisions with history. Replay test scenarios and a subset of prod cases.

FailWorkflow policy was honored for 2 but not for 1.

Why?
To make NonDeterministicWorkflowPolicy feature correct/complete.

How did you test it?
Added an integration test to simulate this scenario.

Potential risks
Users depending on existing buggy behavior can be impacted. This would only happen if and only if all the below holds true

  • user set FailWorkflow policy
  • user expects to not terminate category 1 (see above) workflows

This is not very realistic expectation because users don't know about these subcategories of nondeterminism detection mechanisms.
So the risk of this fix should be minimal.

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropping some notes from a first skim.

overall very close I think, I just need another read in an IDE or something to make sure it's good to go since this area in general is a little bit risky (but totally worth it)

internal/internal_task_handlers.go Show resolved Hide resolved
internal/internal_task_handlers.go Outdated Show resolved Hide resolved
internal/internal_task_handlers_test.go Outdated Show resolved Hide resolved
test/activity_test.go Show resolved Hide resolved
test/integration_test.go Outdated Show resolved Hide resolved
internal/internal_task_handlers.go Outdated Show resolved Hide resolved
test/integration_test.go Show resolved Hide resolved
test/integration_test.go Outdated Show resolved Hide resolved
test/integration_test.go Outdated Show resolved Hide resolved
test/workflow_test.go Outdated Show resolved Hide resolved
test/workflow_test.go Show resolved Hide resolved
test/integration_test.go Show resolved Hide resolved
test/workflow_test.go Outdated Show resolved Hide resolved
test/workflow_test.go Outdated Show resolved Hide resolved
Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure there's a fairly-prominent thing in the changelog (this PR or another, as long as it happens before release), but yep. LGTM, thanks for tracking down the details on this :)

@taylanisikdemir taylanisikdemir force-pushed the taylan/nondeterminism_fail_policy branch from b2f959b to 0542960 Compare November 13, 2023 22:49
@taylanisikdemir taylanisikdemir merged commit 4cf8503 into cadence-workflow:master Nov 13, 2023
@taylanisikdemir taylanisikdemir deleted the taylan/nondeterminism_fail_policy branch November 13, 2023 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants