Adds Workflow State Retention#95
Conversation
20251108-RIS-workflow-state-ttl.md Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>
| ## Design | ||
|
|
||
| When scheduling a workflow, users will be able to configure some duration which upon elapsing after the workflow has reached a terminal state, the workflow will be purged from the actor state store. | ||
| The duration will only start once the workflow has reached either a TERMINATED, COMPLETED, or FAILED state. |
There was a problem hiding this comment.
Would it make sense to support different TTL per state? Users might not want to keep COMPLETED workflows for too long, but might want to keep FAILED ones for longer.
There was a problem hiding this comment.
Yeah that makes sense i think- only think is that it will balloon the options in SDKs a bit
There was a problem hiding this comment.
should we support setting a global ttl in daprd? so clients have to do nothing?
There was a problem hiding this comment.
So current thinking is having a spec.workflow.purgeTTL.{defaultTerminal,completed,failed,terminated} as time durations in the Dapr config.
Then the following optional proto message under the CreateInstanceRequest and ExecutionStartedEvent messages. The workflow runtime will pick the min duration which matches the terminal state of the workflow. Per workflow requests with a matching state will have preference over the Dapr config. If a specific terminal state, and defaultTerminal are defined, the specific terminal state will take precedence.
wdyt?
message InstacePurgeTTL {
optional google.protobuf.Duration defaultTerminal = 1;
optional google.protobuf.Duration completed = 2;
optional google.protobuf.Duration failed = 3;
optional google.protobuf.Duration terminated = 4;
}There was a problem hiding this comment.
perfect, imo even just the dapr configuration is sufficient, so no real need to modify the SDKs
There was a problem hiding this comment.
I like having both, but happy to see either! My feeling is the more we can let users control in terms of behaviour in code, the better.
There was a problem hiding this comment.
I'd just call it default because TTL purging is always at terminal states, right?
Regarding adding a configuration for this, how would this handle changes in the configuration for in-flight workflows? Like if a workflow started when the TTL was set at 60s but the configuration changes to 30s, which TTL will the workflow experience? I would assume it's 60s because it's 'saved' in the workflow, but it might feel confusing for users if they change the configuration but still see workflows being purged after 60s...
There was a problem hiding this comment.
In this case it would the configuration a the time at which the workflow reach that terminal state. It would be possible to see what the TTL is with dapr scheduler list --filter workflow-retention
Co-authored-by: Albert Callarisa <albert@acroca.com> Signed-off-by: Josh van Leeuwen <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>
cicoyle
left a comment
There was a problem hiding this comment.
+1 binding - just a few comments from me. Approving bc my comments are on smaller things, overall I agree with the path forward with this proposal
| spec: | ||
| workflow: | ||
| stateRetentionPolicy: | ||
| anyTerminal: "5s" |
There was a problem hiding this comment.
How does it work if all are set in this example, like anyTerminal with all the rest below? Does the specific one, like completed override the anyTerminal setting?
| } | ||
|
|
||
| message InstanceStateRetentionPolicy { | ||
| optional google.protobuf.Duration allTerminal = 1; |
There was a problem hiding this comment.
| optional google.protobuf.Duration allTerminal = 1; | |
| optional google.protobuf.Duration all_terminal = 1; |
| By using a new actor type, this feature is fully backwards compatible as older clients will not register for this new purge workflow type. | ||
|
|
||
| ``` | ||
| WORKFLOW COMPLETE -> orestrator -> create retentioner reminder -...> execute retentioner reminder -> execute retentioner actor -> execute purge on orchestrator |
There was a problem hiding this comment.
| WORKFLOW COMPLETE -> orestrator -> create retentioner reminder -...> execute retentioner reminder -> execute retentioner actor -> execute purge on orchestrator | |
| WORKFLOW COMPLETE -> orchestrator -> create retentioner reminder -...> execute retentioner reminder -> execute retentioner actor -> execute purge on orchestrator -> cleanup retentioner actor |
|
+1 |
1 similar comment
|
+1 |
No description provided.