-
Notifications
You must be signed in to change notification settings - Fork 55
fix: fixes race conditions in lock acquisition in PG backend #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: fixes race conditions in lock acquisition in PG backend #108
Conversation
@microsoft-github-policy-service agree company="Mosaic" |
|
Thanks for this PR! I think your changes are doing what the original code was meaning to. The original query, which contained both an In any case, if your current PR is fixing the issue you're seeing, then I'm more than happy to merge it as-is. |
|
Hi @JonathanFejtek, just checking in to see if you saw my message above. |
|
@JonathanFejtek , your adjstments will be helpful. |
|
Hey @cgillum @mrjmarcelo . My apologies for radio silence, its been a busy couple of weeks and I haven't had much time outside of work to commit to this. As far as I can tell, the issue with the current lock acquisition code is just that its not concurrency safe for the default Postgres transaction isolation (READ COMMITTED). Basically, two concurrent transactions can get the same result for the sub-select which is evaluated before the At a higher transaction isolation (SERIALIZABLE, for instance, which conveniently is the SQLite default), the existing code would theoretically work without issue but I haven't tested this yet. |
|
I'd be happy to merge this as-is if y'all are good with the change. I've also taken the liberty to add PG testcontainers to this project which can make controlled testing of specific PG behaviours a bit easier. If I have some time this week, I'll try to whip up a test that covers this particular race condition |
cgillum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this current approach. Thanks for the explanation about the race condition!
|
Actually, @JonathanFejtek can you add another commit to update the CHANGELOG.md file and list this PR? Something along the lines of: [vNext]
### Fixed
* Fixed race conditions in lock acquisition in PG backend ([#108](https://github.com/microsoft/durabletask-go/pull/108)) - contributed by [@JonathanFejtek](https://github.com/JonathanFejtek)I'd do it myself but then I'd need to find another approver in order for the PR to be merged. |
|
|
||
| ### Fixed | ||
|
|
||
| * Fixed race conditions in lock acquisition in PG backend ([#108](https://github.com/microsoft/durabletask-go/pull/108)) - contributed by [@JonathanFejtek](https://github.com/JonathanFejtek) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JonathanFejtek the v0.6.0 tag has already been published. Can you add a new section above the v0.6.0 section called [vNext] and add your entry there?
|
Also, @JonathanFejtek looks like there is a new conflict that needs to be resolved due to changes by @mrjmarcelo. |
|
@JonathanFejtek Thanks for this fix. I am waiting for this change :) |
I seem to have found a race condition in the PG backend which leads to the following log:
orchestration-processor: failed to complete work item: instance 'e8007fa4-dad3-48a4-90d7-1844c5631685' no longer exists or was locked by a different workerI've traced the issue to
GetOrchestrationWorkItemandGetActivityWorkItemwhich both acquire leases onNewEventsandNewTasks, respectively. It appears that orchestration workers are able to simultaneously acquire the same lock only to fail later inCompleteOrchestrationWorkItemwhere its determined that only one worker has acquired a lock.I'm not sure if this is meant to be an optimistic concurrency control, but the pessimistic 'lock semantics' and apparent double work being done by workers suggests to me this may be a bug.
I would very much appreciate a review! I'm just digging into this project so I may be totally off base here, but I figured I'd throw this up and get some opinions at least.
(I also fixed a few lint errors in the PG backend along the way)