Proposal: Binary Store Building Block#88
Conversation
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
…a few details, removed an extraneous bullet and generally cleaned it up some Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
|
I'm massively in support, but how does this differ from the Object Store proposal? (Other than no support for metadata, anything else?) |
There are a few differences:
Put more simply - those other stores anticipate the developer wanting to do both simple and far more advanced operations with their data. I'd certainly like to build more specialized data stores to accommodate such requirements, but this proposal seeks to do away with any complexities and do one thing really well: manage the reading, writing and deletion of large files in a resource-limiting and highly performant manner which is not possible in today's Dapr state management. |
|
Seems like the existing s3 and other existing bindings could be mapped. Have you tried a PoC? |
@lindner The first step is proposing the shape of the block (as I've done here) and soliciting public feedback on the API shape and try to discern if anything else seems necessary within the described purpose of the API. Out of the box, I'd certainly like to target support for Azure Blob Storage and provide an S3-compatible component (as this would facilitate connectivity with S3 itself, but also the many providers that offer S3-compatible APIs). Next steps are getting tentative maintainer sign-off (no point building a POC if it's not going to be accepted) and then starting development of it - as I indicated in Discord, I intend to build this out as part of the next Dapr release (1.18). |
|
In the context of its usage for storing large activity inputs and outputs in Workflows, I would strongly recommend that this design allows a workflow author to programmatically choose the path/directory to the binary file. This is to support multi-tenant use-cases where each tenants data MUST be stored in different locations.
Having this location set at the time of scheduling the workflow (not registering the workflow) gives a good level of flexibility. builder.Services.AddDaprWorkflow(options =>
{
options.RegisterWorkflow<MyWorkflow>( BinaryStoreName = "my-binary-store");
}
...
var tenantId = "tenant-a";
var workflowId = "2c0882d7";
await workflowClient.ScheduleNewWorkflowAsync(
name: nameof(MyWorkflow),
instanceId: workflowId,
input: orderInfo,
InputOutputBinaryStorePath: $"/store/{tenantId}/wf/{workflowId}"
);In the example above, assuming we're using an S3 Binary Store, the Activity input / output blobs would be stored in the following location
There is an assumption that workflows have an implicit Activity Id which uniquely identifies each activity call. We use that Activity Id, in the path above. Building on the above example, the Reference to the blob becomes
The Reference is what is encoded in the Workflow History, rather than the blob contents. The SDK can then dereference the data whenever the user demands it throughout the workflow. It may even be the case that the data is never dereferenced, until end of the Workflow when someone requests the output of the completed workflow, which maybe one (or more) large blobs! |
Might this instead be done more like how actors currently stores state in KVs? Set a path on the component at registration time that's used as the root and defer to the workflow to pick an appropriate path to save the reference to relative to the registration path? Presumably the runtime would pick a path referencing the workflow ID and any namespace values itself and then the user needn't figure out how to specify their own paths? |

Increasingly, while writing applications that use Dapr, I keep running into the need to persist data that's too large to reasonably store using Dapr often because it's too large and will exhaust the memory resources of the sidecar, though frequently because it's likely too large to store in a key/value store.
It doesn't make a ton of sense to rely exclusively on bindings for this when that really just provides a Dapr-hosted alternative to the provider's SDK for something that we should increasingly have broad provider support for. Object and blob stores are really overloaded terms representing all manner of things depending on provider for which I think there's a fine opportunity to tackle in the future - this proposal isn't that.
Here, I propose an API devoid of List and even Metadata operations so it can accommodate the broadest of possible storage providers and instead suggest that we increasingly lean on the SDKs to provide the state management instead of putting all that weight on the runtime and the components. It's a slim implementation that should be pretty easily added, but which would provide immediate benefits for popular Dapr features: Workflows and the new Agentic operations come to mind, but it would be beneficial for Actor and Cryptographic operations as well.
I look forward to your feedback!