Skip to content
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions 20251108-RIS-workflow-purge-ttl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Workflow: History Purge TTL on Completion

* Author(s): @joshvanl

## Overview

This proposal details new functionality to the workflow runtime to give users the ability to delete completed workflow state from the actor state store after some configured time.
All workflow instances may be configured with a unique TTL at workflow scheduling time.
The default remains that workflow state will _not_ be deleted from the actor state store, and will remain there indefinitely.

## Background

It is currently the case that in order for users to delete old workflow state from the actor state store database, they either need to use the Purge Workflow API, or delete state from the database directly, either via out of Dapr database operations, or via using some kind of first class TTL feature of that database.
Users typically want to delete old workflow state after some period of time from when the workflow has reached a terminal state.

https://github.com/dapr/dapr/issues/9020

## Design

When scheduling a workflow, users will be able to configure some duration which upon elapsing after the workflow has reached a terminal state, the workflow will be purged from the actor state store.
The duration will only start once the workflow has reached either a TERMINATED, COMPLETED, or FAILED state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to support different TTL per state? Users might not want to keep COMPLETED workflows for too long, but might want to keep FAILED ones for longer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense i think- only think is that it will balloon the options in SDKs a bit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we support setting a global ttl in daprd? so clients have to do nothing?

Copy link
Contributor Author

@JoshVanL JoshVanL Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So current thinking is having a spec.workflow.purgeTTL.{defaultTerminal,completed,failed,terminated} as time durations in the Dapr config.

Then the following optional proto message under the CreateInstanceRequest and ExecutionStartedEvent messages. The workflow runtime will pick the min duration which matches the terminal state of the workflow. Per workflow requests with a matching state will have preference over the Dapr config. If a specific terminal state, and defaultTerminal are defined, the specific terminal state will take precedence.
wdyt?

message InstacePurgeTTL {
  optional google.protobuf.Duration defaultTerminal = 1;
  optional google.protobuf.Duration completed = 2;
  optional google.protobuf.Duration failed = 3;
  optional google.protobuf.Duration terminated = 4;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, imo even just the dapr configuration is sufficient, so no real need to modify the SDKs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having both, but happy to see either! My feeling is the more we can let users control in terms of behaviour in code, the better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just call it default because TTL purging is always at terminal states, right?

Regarding adding a configuration for this, how would this handle changes in the configuration for in-flight workflows? Like if a workflow started when the TTL was set at 60s but the configuration changes to 30s, which TTL will the workflow experience? I would assume it's 60s because it's 'saved' in the workflow, but it might feel confusing for users if they change the configuration but still see workflows being purged after 60s...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it would the configuration a the time at which the workflow reach that terminal state. It would be possible to see what the TTL is with dapr scheduler list --filter workflow-retention


Any duration may be given, i.e. days, weeks, or years.
A duration of `0` may also be given, if the workflow actor state is wished to be deleted immediately after reaching a terminal state.

### Usage

#### CLI

Users can give a Go style duration string when running a workflow from the CLI.

```bash
$ dapr run my-workflow --purge-ttl=5d
```

```bash
$ dapr run my-workflow --purge-ttl=0s
```

The new purge reminders will be displayed like:

```bash
$ dapr scheduler list
NAME BEGIN COUNT LAST TRIGGER
purge-workflow/my-workflow 96h 0
```

#### Go

```go
wf.ScheduleWorkflow(ctx, "my-workflow", workflow.WithPurgeTTL(time.Hour*24*5))
```

```go
wf.ScheduleWorkflow(ctx, "my-workflow", workflow.WithPurgeTTL(0))
```

#### Python

```python
wfClient.schedule_new_workflow(workflow=my_workflow, putge_ttl=timedelta(days=5))
```

```python
wfClient.schedule_new_workflow(workflow=my_workflow, putge_ttl=timedelta(seconds=0))
```

#### Javascript

```js
workflowClient.scheduleNewWorkflow({workflow: MyWorkflow, putge_ttl: Temporal.Duration.from({days: 5})})
```

```js
workflowClient.scheduleNewWorkflow({workflow: MyWorkflow, putge_ttl: Temporal.Duration.from({})})
```

#### .NET

```dotnet
workflowClient.ScheduleNewWorkflowAsync(
name: nameof(MyWorkflow),
stateTTL: TimeSpan.FromDays(5);
);
```

```dotnet
workflowClient.ScheduleNewWorkflowAsync(
name: nameof(MyWorkflow),
stateTTL: TimeSpan.FromSeconds(0)
);
```

#### Java

```java
opts.setPurgeTTL(Duration.ofDays(5));
workflowClient.scheduleNewWorkflow(OrderProcessingWorkflow.class, opts);
```

```java
opts.setPurgeTTL(Duration.ofSeconds(5));
workflowClient.scheduleNewWorkflow(OrderProcessingWorkflow.class, opts);
```

### Runtime

#### protos

The following protos will be updated with the new purge TTL duration field so it is piped from workflow creation to execution.

The new option will be added to `CreateInstanceRequest`, populated by the client.

```proto
message CreateInstanceRequest {
string instanceId = 1;
string name = 2;
// EXISTING
google.protobuf.Duration purgeTTL = 10; // NEW
}
```

`ExecutionStartedEvent` will contain the TTL duration which signals the duration after which the workflow has completed should be purged.
This field will be persistent in the history log.
This field will be populated by the durabletask backend executor, piping the field from `CreateInstanceRequest`.

```proto
message ExecutionStartedEvent {
string name = 1;
// EXISTING
google.protobuf.Duration purgeTTL = 10; // NEW
}
```

#### Actors

Upon workflow reaching a terminal state, after the orchestraion actor has written the result to the actor state store, it will then create an actor reminder if the `purgeTTL` field is present in the execution started event.

This reminder will target a new actor workflow type, with the reminder name being the instance ID of the workflow.

The new actor type will follow convention and have the following form:

```
dapr.internal.<namespace>.<app-id>.purge-workflow
```

Upon activation of the reminder, the new purge actor will be activated, call the purge API on the workflow orchestrator actor for the given instance ID, and then deactivate itself.
Along with the other workflow actor types, this type will be registered on workflow client connection, and unregistered on workflow worker client disconnection.

By using a new actor type, this feature is fully backwards compatible as older clients will not register for this new purge workflow type.


```
WORKFLOW COMPLETE -> orestrator -> create purge reminder -...> execute purge reminder -> execute purge actor -> execute purge on orchestrator
```

# Alternatives

Another option is to use the actor TTL state store functionality to delete store keys based on individual key TTls.
This is not appropriate as it _must_ be the case that workflow data be only delete from the state store once the workflow has reached a terminal state.
Not doing so would corrupt the workflow processing.
It is therefore necessary that the Purge API is used to delete the stored data, which itself processes the request inside the same workflow state machine.