-
Notifications
You must be signed in to change notification settings - Fork 39
Adds Workflow State Retention #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
20251108-RIS-workflow-state-ttl.md Signed-off-by: joshvanl <[email protected]>
Signed-off-by: joshvanl <[email protected]>
| ## Design | ||
|
|
||
| When scheduling a workflow, users will be able to configure some duration which upon elapsing after the workflow has reached a terminal state, the workflow will be purged from the actor state store. | ||
| The duration will only start once the workflow has reached either a TERMINATED, COMPLETED, or FAILED state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to support different TTL per state? Users might not want to keep COMPLETED workflows for too long, but might want to keep FAILED ones for longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense i think- only think is that it will balloon the options in SDKs a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we support setting a global ttl in daprd? so clients have to do nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So current thinking is having a spec.workflow.purgeTTL.{defaultTerminal,completed,failed,terminated} as time durations in the Dapr config.
Then the following optional proto message under the CreateInstanceRequest and ExecutionStartedEvent messages. The workflow runtime will pick the min duration which matches the terminal state of the workflow. Per workflow requests with a matching state will have preference over the Dapr config. If a specific terminal state, and defaultTerminal are defined, the specific terminal state will take precedence.
wdyt?
message InstacePurgeTTL {
optional google.protobuf.Duration defaultTerminal = 1;
optional google.protobuf.Duration completed = 2;
optional google.protobuf.Duration failed = 3;
optional google.protobuf.Duration terminated = 4;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect, imo even just the dapr configuration is sufficient, so no real need to modify the SDKs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having both, but happy to see either! My feeling is the more we can let users control in terms of behaviour in code, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just call it default because TTL purging is always at terminal states, right?
Regarding adding a configuration for this, how would this handle changes in the configuration for in-flight workflows? Like if a workflow started when the TTL was set at 60s but the configuration changes to 30s, which TTL will the workflow experience? I would assume it's 60s because it's 'saved' in the workflow, but it might feel confusing for users if they change the configuration but still see workflows being purged after 60s...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it would the configuration a the time at which the workflow reach that terminal state. It would be possible to see what the TTL is with dapr scheduler list --filter workflow-retention
Co-authored-by: Albert Callarisa <[email protected]> Signed-off-by: Josh van Leeuwen <[email protected]>
Signed-off-by: joshvanl <[email protected]>
Signed-off-by: joshvanl <[email protected]>
cicoyle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 binding - just a few comments from me. Approving bc my comments are on smaller things, overall I agree with the path forward with this proposal
| spec: | ||
| workflow: | ||
| stateRetentionPolicy: | ||
| anyTerminal: "5s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it work if all are set in this example, like anyTerminal with all the rest below? Does the specific one, like completed override the anyTerminal setting?
| } | ||
|
|
||
| message InstanceStateRetentionPolicy { | ||
| optional google.protobuf.Duration allTerminal = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| optional google.protobuf.Duration allTerminal = 1; | |
| optional google.protobuf.Duration all_terminal = 1; |
| By using a new actor type, this feature is fully backwards compatible as older clients will not register for this new purge workflow type. | ||
|
|
||
| ``` | ||
| WORKFLOW COMPLETE -> orestrator -> create retentioner reminder -...> execute retentioner reminder -> execute retentioner actor -> execute purge on orchestrator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| WORKFLOW COMPLETE -> orestrator -> create retentioner reminder -...> execute retentioner reminder -> execute retentioner actor -> execute purge on orchestrator | |
| WORKFLOW COMPLETE -> orchestrator -> create retentioner reminder -...> execute retentioner reminder -> execute retentioner actor -> execute purge on orchestrator -> cleanup retentioner actor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thats right?
passuied
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition! Thanks!
No description provided.