-
Notifications
You must be signed in to change notification settings - Fork 39
Dapr Workflow Versioning - A Unified Approach to Types and Patches #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Whit Waldo <[email protected]>
…r#92 and how they're uniquely addressed in this approach Signed-off-by: Whit Waldo <[email protected]>
|
The showstopper for me here is that this doesn't allow multiple versions in the branching approach. A binary flag stops incremental evolution of long running workflows entirely |
|
But that's the point - as the architects enabling simpler distributed application development, we should not be mandating that developers write spaghetti under the guise of flexibility. By adding versions to branches, it's increasingly unclear when a bug has been introduced into a version branch until runtime requiring still another patch version to try cleaning it up. I advocate that patches take a reduced role here - use them to apply small changes to the existing workflow type. Apply more than one as you need to and even wrap patches into other patches (they're In this proposal there's a clear versioning hierarchy - a type version is a final and clean break from the previous workflow implementation because any patches live exclusively in the versioned workflow type they're written in. It's the best route to encouraging clear, intentional code. Patches provide a means to avoid whole type versions for minor changes. They're not intended to be a comprehensive substitute, but serve more as a reduced-level-of-effort stop-gap measure that addresses a small change here, a tweak there and a typo over here. And when their workflow is nothing but a wall of band-aids, why wouldn't we want to encourage refactoring and provide tools to migrate developers to a clean new version? Again, this is somewhere that CLI tooling can (eventually) help with - because the branching is only binary (new vs old code), creating a new version could drop the Do note (per the last bullet point in the FAQ) that unlike in Temporal, long-running workflows can be versioned (to a limited degree) by patch in this proposal and by type without any of the drawbacks of Temporal's approach. With the approach I've proposed here, the two versioning systems are unified in a way that complement each other cleanly - one addresses patches (pure patches, no versioning) and the other addresses clear versions (clean the board of all patches, start fresh). If released separately from one another and we go a pure versioned patch route per #92 , we have told users that spaghetti is acceptable and possibly even desired because we provided no alternative options. Should you ever want to eliminate old branches, now you need to actively incorporate runtime monitoring metrics into your development cycle to understand where you can prune some old paths, but that if you opt into versioning, you'll never get out of a bog of branches. When developers step away for the weekend and come back to their code Monday and think, "Woah, this is a mess," having both concepts in one release showcases that as the Dapr architects, we've proactively thought about this inevitable conclusion and have a solution here and now (not wait for some future release when we get back to it): add a type version, refactor your code as you want, remove all the patches and take the moment to apply any other tweaks you want before repeating with patches in your new workflow going forward. Rinse and repeat. If you don't want to patch your code, you don't have to. Without a clear means of halting patches on patches on patches, I think we introduce significant complexity that would make forward-looking developers think twice about the benefits of incorporating versioning into their workflows at all, so yes, I do think that both should be released as part of the same comprehensive versioning release because it speaks to the whole story. The State of Dapr 2025, page 6, cites that what people like the most about Dapr. The top response with 74% was that it "Reduces microservices complexity", 49% said it "makes operating applications simpler" and 37% said it was easy to get started. The top improvement developers would like to see is "debugging/troubleshooting is hard" and #3 is that they want a "better developer experience". We've all agreed that spaghetti code isn't desirable - why would we set up developers on an inevitable and intractable path towards it? I would urge us not to overthink Workflow Versioning v1. Should users come to us complaining that the combined approach in this proposal is too limited for their purposes, I urge collecting their feedback over why and how this doesn't meet their needs and address those perceived limitations with a more targeted change in the future. What we shouldn't strive for is a complex inline-versioning concept that itself may be too weighty for the average developer to comprehensively understand and thereby preempt any versioning usage in the first place. |
|
@WhitWaldo, while I understand your approach, I still think that mimicking how temporal deal with branches doesn't sounds like a bad idea. Also, as far as my understanding goes with @acroca, you can also achieve this, it is just more flexible for people who need the flexibility. I am a very practical guy, so I would prefer to see #82 move forward quickly so we can try it out and iterate quickly. |
|
@salaboy I guess that's what I see as the benefit to this proposal - by reducing the change on the runtime to a single state value, and simplifying what's proposed in #92 (you +1'ed this one, not #82), I see no reason why they can't easily be implemented at the same time. Patching in this proposal is little more than adding a three-prong decision tree with a method on the workflow context. #92 is a bit more involved. Further, I think my proposal here is far more aligned with how Temporal performs patching than #92 - believe me, I spent quite a bit of time comparing the two while typing this up. Note their use of a named identifier with |
|
I'd like to highlight that Temporal has different ways of implementing patching for the different SDKs. The one you linked seems to be available in .NET, Python and Ruby. If you check their docs about workflow patching, other SDKs have another approach, in particular Go and Java (I didn't check them all) have a |
|
@acroca That's fair - I assumed they took a standard patching approach. My patching style in this proposal is very similar to that implemented in their .NET, TypeScript, and Python SDKs. It appears their Go, Java and PHP implementations approach it more akin to the way you describe in #92. This seems quite odd to me given that I thought their SDK clients were automatically generated. @yaron2 @acroca As of January of this year, tickets have been opened from this issue looking to implement memoization support on all SDKs uniformly and the three SDKs with getVersion support: Java SDK, PHP SDK and Go SDK all note a bullet for "(Maybe) Deprecate the GetVersion API". Does anyone have a contact at Temporal that might be able to shed insight into why they're thinking about dropping it? |
|
I didn't know Temporal is moving away from the if ctx.IsPatched("new_payment_system") {
if ctx.IsPatched("newer_payment_system") {
if err := ctx.CallActivity(NewerPayment).Await(&number); err != nil {
return nil, err
}
} else {
if err := ctx.CallActivity(NewPayment).Await(&number); err != nil {
return nil, err
}
}
} else {
if err := ctx.CallActivity(Payment).Await(&number); err != nil {
return nil, err
}
}I think this would look better with an if/switch statement. I understand the use of patches, but if we have to chose boolean patches or numerical branch versions, I think branch versions are more flexible. Patches could come later, and could even be implemented using branches behind the scenes. |
|
From what I can find across closed issues in their repository and StackOverflow questions, my guess is that they're dropping it because developers certainly get a lot of flexibility from the approach, but they've increasingly used that to produce non-deterministic edge-case implementations and casually wondered why things didn't work. The approach instead appears to be angled towards simplifying the tooling, narrowing the opportunity to introduce unintended errors through increasingly complicated code and largely favor a simplified patching + worker versioning approach that's represented in this proposal. I wholeheartedly agree that the wall of patching looks bad there - it's why my proposal here and proposed documentation would strongly advise users to treat patches as a stop-gap and not an end-solution. That when it's inevitably a mess, favor refactoring (e.g., remove all the patches to restore a single logical path) to a clean file and continue. That's the whole point of #94 is to provide for both in-line patching but also a clear picture of how to "reset" the workflow in a deterministic way and drop all the patches when it gets too cumbersome to manage. Taking your example above, I certainly agree it'd get messy to keep adding a branch. But by shifting to a new type version, tooling can take the if err := ctx.CallActivity(NewerPayment).Await(&number); err != nil {
return nil, err
}The developer is welcome to start adding patches to this new version as needed and when the code is again tricky to follow, version again and repeat. Should the developer want to look at older workflows, the types are present and viewable, but I'd argue it's just not necessary to keep that history at the forefront of every workflow change because the developer is writing what they want to run "next" and just needs a deterministic orchestrator to ensure what previously ran migrates here without error. Temporal's type versioning approach today has some significant downsides that are addressed and dismissed by my proposal here. I've made the case for why it's paramount that both patching and type versioning approaches be released at the same time and why it's only to our benefit for them to be designed in a complementary manner as I've outlined here. |
|
+1 binding After futher discussions, I am ok with voting to this one |
JoshVanL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for consolidating the two strategies, I think it makes sense to go with both.
To move forward, I would like to have the proto definitions of the version and patches authored in the proposal, as well as code examples of usage other than the dotnet SDK.
| Patches themselves are not versioned. They exist within the scope of the versioned workflow type, but they themselves | ||
| do not exist beyond two states. | ||
|
|
||
| ##### Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be the case that SDKs add a check after every workflow reconcile execution to ensure that all patches previously observed on a previous run, are observed in the next run. Consider the following case:
R1 and R2 are the same workflow but running on different replicas.
R1: contains the patch abc
R2: does not contain the patch.
R1 executes and observes the patch abc, and positively enters it.
R2 executes next after a dissemination event, but did not observe the patch abc.
R2 must back off from submitting the work inbox and should either leave the workflow in stasis, or ideally, set a new condition on the workflow status.
After execution, it must be the case that the client checks that all patches previously observed are observed again, and that they were observed in the same order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree on both points.
But that's why the context keeps track of which patches run in the workflow execution and then persist it when the workflow finishes executing is for precisely this point.
Unlike Temporal, I did not intend to add a separate workflow event to indicate going down a patch or not and instead intended it would just be a list maintained on the workflow context. Temporal does it so they're returned alongside any other events as part of a workflow query, but we don't have that sort of tooling or necessity.
I'm certainly open to your take on it - do you think it'd be better to have the patch evaluation occur and register as another of the many workflow events so it's available in the workflow history or the current approach of persisting the truthily evaluated patches on the context and then writing it in the workflow conclusion event? What condition should R2 set on the workflow status to indicate it failed because of a versioning/patching mismatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need to give the list of patches in the workflow registration.
When a workflow runs in a replica, the SDK will need to check all the enabled patches from the history, and make sure all those patches are part of the workflow registration, as it needs to make sure it knows how to run the specific patched version.
In the example from Josh, if the workflow ran in R1 and enabled the patch abc, this information will be in the state, so R2 will not accept this workflow because abc is not part of the registered patches for this workflow.
So we'll need an array of strings in the RegisterVersionedWorfklow with the list of known patches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm extremely hesitant to do that as that will suddenly make the whole workflow versioning quite brittle from the users perspective and introduce a heck of a footgun if developers neglect to keep the workflow itself and its registration in sync (heaven help them if they deploy an out-of-sync pair). This scenario is precisely one of the reasons I was hesitant about patching support at all.
I'll put some thought into alternative strategies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on this one? I think it's the only missing question we have open atm.
I can see two solutions:
A: We provide all the list of supported patches at the time of workflow registration. With this list, when the SDK processes a workflow instance it'll double check all the enabled patches so far are in the list of supported patches.
The only problem I have with this is that the user will have to remember to keep this list up to date, which makes it very very error prone.
B: In the SDK we can detect a potential missing patch when we encounter a mismatch between the workflow state and the code. When a mismatch is detected, we check if the OrchestratorStarted event at that point has any patch enabled that hasn't been enabled so far by the current version of the workflow. We can assume that a mismatch in code-vs-state in a orchestrator started block that has a patch we haven't seen yet is because the state was generated in a different version of the workflow that had a patch.
In both cases we'll need to put the workflow in a new runtime state called FROZEN (open to naming ideas) that would not run the workflow until the next actor dissemination triggers.
This is because the actor instance for this workflow is assigned to this specific replica, and it will not move until there's an dissemination event, so the workflow will remain FROZEN until then.
So when a dissemination happens, we need to unfreeze all frozen workflows and try again, they might freeze again if they end up in an old replica again, or continue executing if they end up in a new replica.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FROZEN is not a terminal state as the workflow has not completed- and can/will continue in future once a client is able to process it. We should be very specific when we use the word terminal.
We can have a single FROZEN workflow status, with a enum reason describing why it has frozen. This is how we treat other statuses like FAILED. We will probably use FROZEN for some other behaviour in the future.
We should not be creating a event to write the patches to history. It is expensive and the existing history events are capable of holding that data in existing messages.
FROZEN workflows do not require a re-run, and in-fact they cannot be because they are not in a terminal state. They would need to be terminated (made into a terminal state) before a re-run can executed. It also doesn't make sense to rerun a frozen workflow, because definitionally the workflow cannot be serviced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the term PENDING which I believe was a suggestion from @mikeee - and I agree with the enum for reasoning associated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe both the PlacementV2 proposal + the Hard stickiness for actors proposal will help mitigate the wf versioning/patching mismatch issue discussed here - assuming I caught up on the thread correctly.
PlacementV2 reduces dissemination frequency and scope (only affected actor types rebalance), while hard stickiness keeps actors on the same healthy replica longer - together they dramatically lower the chance of an instance landing on an incompatible replica and triggering the PENDING state.
Neither fully eliminates the problem, but they further reduce the risk of this issue occurring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoshVanL The point of adding an event over supplementing the property of an existing event is twofold:
- It opens the door to a re-run that starts on that event (potentially opting for a different patch route as part of the run)
- It allows us to determine the order in which patches were evaluated in a previous run (especially between other workflow replays), giving us better tools to understand if the user broke determinism by re-using the same name in a different place within the same type version.
Using a property on an existing event means option 1 isn't feasible as the re-run starts farther up and we lack the means to pre-set the resulting patch value. Perhaps this is preferred so patches always default to the newer branch even with replay, but it's worth considering one way or another now.
I certainly defer to you on the "expensiveness" of adding another event or not, but I wanted to explicitly call out the advantages of doing so in this edge of edge cases (especially with @cicoyle 's new proposals).
I'm fine reserving "terminal" to mean an entirely dead, cannot be revived workflow. PENDING is certainly fine by me - I'm not picky about the name so long as it's clear that it's something the user has little control over as it requires the runtime to resolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the proposal here to reflect the mismatch handling concerns using the value PENDING.
The runtime will need a tweak here such that if a workflow currently marked as PENDING returns with a PENDING error again after the expected dissemination event has occurred, it should instead mark the workflow as FAILED because it suggests that the developer has re-used the patch name somewhere else within the same version and that is an unrecoverable failure.
Signed-off-by: Whit Waldo <[email protected]>
Signed-off-by: Whit Waldo <[email protected]>
…h-level presentation and moved into the C# section towards the bottom. Signed-off-by: Whit Waldo <[email protected]>
…ils where I wouldn't normally include them in dapr/proposals Signed-off-by: Whit Waldo <[email protected]>
…updated other prototypes accordingly and cleaned up how it's described in the proposal. Replaces references to "version string" accordingly and removed patch name constraints. Signed-off-by: Whit Waldo <[email protected]>
Signed-off-by: Whit Waldo <[email protected]>
…ript SDK Signed-off-by: Whit Waldo <[email protected]>
|
@JoshVanL I've updated the proposal to include .NET and JS examples and implementation details. |
Signed-off-by: Whit Waldo <[email protected]>
…w events instead of making it part of the request/response to the runtime. Signed-off-by: Whit Waldo <[email protected]>
Signed-off-by: Whit Waldo <[email protected]>
Co-authored-by: Albert Callarisa <[email protected]> Signed-off-by: Whit Waldo <[email protected]>
|
+1 binding |
cicoyle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I am mostly good with this proposal combining the strategies of both type based versioning and patching.
I only have a handful of minor comments (mainly around the final workflow state keyword which I believe should be PENDING and some typos), nothing blocking.
With those addressed, I’m happy to give my binding approval.
| Patches themselves are not versioned. They exist within the scope of the versioned workflow type, but they themselves | ||
| do not exist beyond two states. | ||
|
|
||
| ##### Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the term PENDING which I believe was a suggestion from @mikeee - and I agree with the enum for reasoning associated.
| Patches themselves are not versioned. They exist within the scope of the versioned workflow type, but they themselves | ||
| do not exist beyond two states. | ||
|
|
||
| ##### Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe both the PlacementV2 proposal + the Hard stickiness for actors proposal will help mitigate the wf versioning/patching mismatch issue discussed here - assuming I caught up on the thread correctly.
PlacementV2 reduces dissemination frequency and scope (only affected actor types rebalance), while hard stickiness keeps actors on the same healthy replica longer - together they dramatically lower the chance of an instance landing on an incompatible replica and triggering the PENDING state.
Neither fully eliminates the problem, but they further reduce the risk of this issue occurring.
Co-authored-by: Cassie Coyle <[email protected]> Signed-off-by: Whit Waldo <[email protected]>
Co-authored-by: Cassie Coyle <[email protected]> Signed-off-by: Whit Waldo <[email protected]>
Co-authored-by: Cassie Coyle <[email protected]> Signed-off-by: Whit Waldo <[email protected]>
Co-authored-by: Albert Callarisa <[email protected]> Signed-off-by: Whit Waldo <[email protected]>
Signed-off-by: Whit Waldo <[email protected]>
…low-versioning2 # Conflicts: # 20251028-BRS-workflow-versioning.md
…mpletedEvent` Signed-off-by: Whit Waldo <[email protected]>
…mismatching values across `OchestratorStartedEvent` messages Signed-off-by: Whit Waldo <[email protected]>
Co-authored-by: Josh van Leeuwen <[email protected]> Signed-off-by: Whit Waldo <[email protected]>
|
I believe I've updated the proposal to address all concerns at this time. Please correct me if I'm wrong - happy to review on tomorrow's release call. |
cicoyle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 binding
|
+1 binding |
2 similar comments
|
+1 binding |
|
+1 binding |
I submit this proposal as a cohesive reimagining of the versioning approaches documented in #82 (type-based) and #92 (patching).
As I mentioned on today's release call, I believe there's meaningful value in offering both approaches to Dapr developers, as they cater to distinct but complementary use cases. However, if Dapr is to support both, it should do so through a unified, thoughtfully integrated model - one that avoids fragmentation and instead leverages the strengths of each approach in concert.
In this proposal, I've refined the type-based strategy from #82 to address concerns raised in the discussion of #92, while also dramatically simplifying the patching mechanism proposed in #92. The core philosophy is this: types represent a "hard fork" or clean break between workflow versions, ideal for major changes such as refactoring opportunities. While this can lead to an increase in type definitions, it's unnecessary for minor, non-breaking changes.
That's where patches come in. Patches are named, scoped to a specific versioned workflow type, and implemented similarly to Temporal's model - but with a key difference: instead of adding events to the workflow history, patch metadata is embedded in the version string. When a workflow calls
context.IsPatched(string name), the new code path is deterministically taken. I've removed the elaborate versioning scheme from #92 and replaced with a simpler binary model: each patch has only two branches - new vs old code. If no version is present (i.e., on first execution), the new code path is always taken. The old path is only executed during replay if it was taken in the original run.This approach significantly reduces cognitive overhead for developers, making is easier to reason about patch behavior and understand when either versioning strategy makes more sense. It naturally encourages the introduction of new workflow types when patching becomes too complex - providing a clean path to refactoring.
In short, this proposal unifies and improves upon #82 and #92 by:
If Dapr is going to support workflow versioning (and I wholeheartedly think it should), it should do so in a way that feels cohesive - not like two competing paradigms bolted together. This proposal aims to deliver that cohesion by maximizing the strengths of both approaches in a complementary, developer-friendly way.
I believe the level of work of combining the two in this way is not significantly greater than either individual approach and can be done in the same original timeline.