Skip to content

Conversation

@davidmrdavid
Copy link
Collaborator

Problem statement 🙀

The DF Python SDK's main goal is to rehydrate an orchestrator's state by processing its internal history log. Previously, our algorithm for processing an orchestration's history was quite naive: each DF API invocation could incur on a full scan of the history log; this yielded a quadratic runtime complexity algorithm.

This PR's main changes 😎

Among other things, this PR replaces this algorithm with a more efficient one. The new approach does a linear scan of the history irrespective of the number of DF API invocations. The approach borrows heavily from the TaskOrchestrationExecutor in durableTask but it also takes a few organizational liberties to maximize code reuse.

This PR represents an algorithmic change to the core of the DF Python SDK, and so it touches almost all code-paths that would get invoked during day-to-day usage of the library. As a result, this PR demands significant manual testing before merging to prevent regressions.

Outline of changes

Task-management classes

A major change of this PR is to provide a state-machine implementation of Tasks. Here we implement TaskBase, which is a base class for all Tasks and provides shared logic for all its subclasses. Similarly, CompoundTask is a child of TaskBase but a base class for all compound tasks such as WhenAll and WhenAny.

A particular notable Task abstraction introduced is the RetryAbleTask. This class is meant to encapsulate the logic for determining if a DF API with the -WithRetry suffix is done executing. Introducing this abstraction was necessary for two reasons: (1) it kept the task-management code simple as it allowed us to have 1 task per every DF API and (2) it allowed us to piggyback on the shared Task logic by framing the of scheduling timers and retries of the original task as "sub-tasks" of the RetryAble task.

The new replay driver

The next big change in this PR is the introduction of the TaskOrchestrationExecutor class, which takes care of interleaving the linear pass through the orchestration history with the resumption of the orchestration generator function. This class was written with code-reuse in mind, so it introduces many shared utilities that abstract over common task-creating and task-updating procedures that occur in response to specific history events.

ToDos and other FYIs

As an FYI, I had to remove two very small test files which tested the behavior of utilities that no longer exist. I don't feel great about removing tests, so I'll look to re-implement a version of these tests that targets the same behaviors before merging.

In addition to re-implementing some tests, we should consider doing a lot of manual testing before merging. A first step will be to test that all samples continue working. After that, we can do a bug-bash of sorts to test further edge-cases.

Finally, this PR wouldn't be complete without a full set of comparative benchmark results of this new algorithm against the current approach. Before merge, I'll be posting those results for us to understand the performance impact of these changes.

I'm really excited about this change! Thanks y'all! ✨✨

@davidmrdavid davidmrdavid added the Performance About the performance (speed, memory usage, etc) of the SDK label Jul 1, 2021
Copy link

@ConnorMcMahon ConnorMcMahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super excited to see this. Left some initial comments on areas to clear up the code from a reader's perspective, but this does seem substantially cleaner than our old implementation.

Copy link

@ConnorMcMahon ConnorMcMahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though there is a rename that looks like we still haven't done quite yet.

@@ -0,0 +1,314 @@
from azure.durable_functions.models.RetryOptions import RetryOptions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we still haven't resolved this task.

@davidmrdavid
Copy link
Collaborator Author

My latest commits includes changes for event_sent and raise_event processing :)

Copy link

@ConnorMcMahon ConnorMcMahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance About the performance (speed, memory usage, etc) of the SDK

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants