-
Notifications
You must be signed in to change notification settings - Fork 64
Adopt a linear-time replay algorithm #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopt a linear-time replay algorithm #302
Conversation
ConnorMcMahon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super excited to see this. Left some initial comments on areas to clear up the code from a reader's perspective, but this does seem substantially cleaner than our old implementation.
ConnorMcMahon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, though there is a rename that looks like we still haven't done quite yet.
| @@ -0,0 +1,314 @@ | |||
| from azure.durable_functions.models.RetryOptions import RetryOptions | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we still haven't resolved this task.
|
My latest commits includes changes for event_sent and raise_event processing :) |
ConnorMcMahon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Problem statement 🙀
The DF Python SDK's main goal is to rehydrate an orchestrator's state by processing its internal history log. Previously, our algorithm for processing an orchestration's history was quite naive: each DF API invocation could incur on a full scan of the history log; this yielded a quadratic runtime complexity algorithm.
This PR's main changes 😎
Among other things, this PR replaces this algorithm with a more efficient one. The new approach does a linear scan of the history irrespective of the number of DF API invocations. The approach borrows heavily from the
TaskOrchestrationExecutorindurableTaskbut it also takes a few organizational liberties to maximize code reuse.This PR represents an algorithmic change to the core of the DF Python SDK, and so it touches almost all code-paths that would get invoked during day-to-day usage of the library. As a result, this PR demands significant manual testing before merging to prevent regressions.
Outline of changes
Task-management classes
A major change of this PR is to provide a state-machine implementation of Tasks. Here we implement
TaskBase, which is a base class for all Tasks and provides shared logic for all its subclasses. Similarly,CompoundTaskis a child ofTaskBasebut a base class for all compound tasks such asWhenAllandWhenAny.A particular notable Task abstraction introduced is the
RetryAbleTask. This class is meant to encapsulate the logic for determining if a DF API with the-WithRetrysuffix is done executing. Introducing this abstraction was necessary for two reasons: (1) it kept the task-management code simple as it allowed us to have 1 task per every DF API and (2) it allowed us to piggyback on the shared Task logic by framing the of scheduling timers and retries of the original task as "sub-tasks" of the RetryAble task.The new replay driver
The next big change in this PR is the introduction of the
TaskOrchestrationExecutorclass, which takes care of interleaving the linear pass through the orchestration history with the resumption of the orchestration generator function. This class was written with code-reuse in mind, so it introduces many shared utilities that abstract over common task-creating and task-updating procedures that occur in response to specific history events.ToDos and other FYIs
As an FYI, I had to remove two very small test files which tested the behavior of utilities that no longer exist. I don't feel great about removing tests, so I'll look to re-implement a version of these tests that targets the same behaviors before merging.
In addition to re-implementing some tests, we should consider doing a lot of manual testing before merging. A first step will be to test that all samples continue working. After that, we can do a bug-bash of sorts to test further edge-cases.
Finally, this PR wouldn't be complete without a full set of comparative benchmark results of this new algorithm against the current approach. Before merge, I'll be posting those results for us to understand the performance impact of these changes.
I'm really excited about this change! Thanks y'all! ✨✨