Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In Object Store, return version & etag on multipart put. #5443

Closed
ashtuchkin opened this issue Feb 28, 2024 · 7 comments · Fixed by #5500
Closed

In Object Store, return version & etag on multipart put. #5443

ashtuchkin opened this issue Feb 28, 2024 · 7 comments · Fixed by #5500
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface

Comments

@ashtuchkin
Copy link

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

To protect integrity of our uploads, we store object versions together with s3 file paths. This is currently easy for regular ObjectStore::put requests, but is not possible with multipart uploads when using ObjectStore::put_multipart. The current return values of the latter (MultipartId and AsyncWrite) do not allow for an easy return of values, so maybe it's possible to add this to WriteMultiPart (so that we don't reimplement it using just MultiPartStore/PutPart traits).

Describe the solution you'd like

  1. (Optional) Make it possible to construct a PutPart from a MultiPartStore. Currently when you do create_multipart, it just returns a multipart id. Maybe add a create_multipart_put_part(&self, path: &Path) -> Result<Box<dyn PutPart>>? This will enable using WriteMultiPart on any MultiPartStore from the client code. This is good because then users can tune WriteMultiPart parameters like concurrency and potentially part size.
  2. PutPart::complete should return Result<PutResult>
  3. WriteMultiPart should remember this PutResult and provide a method to take it after shutdown is polled to completion.

After a bit more analysis, it looks like PutPart and MultiPartStore are duplicating put_part and complete methods. Plus, it looks like you can create a generic PutPart implementation given just the MultiPartStore interface. What do you think about simplifying this? There are 2 options:

  1. Only keep create_multipart method in MultiPartStore, make it return PutPart. Add "abort" and "get_upload_id" methods to the PutPart. This way
  2. Keep MultiPartStore as-is, remove PutPart trait completely, make WriteMultiPart work with MultiPartStore directly. This will require adding location and upload_id to WriteMultiPart.

Describe alternatives you've considered

  • Reimplement WriteMultiPart logic with the changes above. There's a lot of code there though.
  • Issue a HEAD request right after the put. This will work in most cases, but is not guaranteed.

Additional context
I'm using version 0.9.

@ashtuchkin ashtuchkin added the enhancement Any new improvement worthy of a entry in the changelog label Feb 28, 2024
@tustvold
Copy link
Contributor

tustvold commented Feb 28, 2024

The reason PutPart and MultipartStore are separate is largely historical, PutPart came first. That being said, removing PutPart would be a breaking change and doesn't seem like it would materially reduce code, as the underlying logic is already shared.

R.e. returning PutPart from ObjectStore, this would prevent implementation for LocalFilesystem and other stores that don't have native multipart.

In the short-term the ETag could be obtained by performing a HEAD request afterwards, I will have a think about how we might support this better going forwards

@ashtuchkin
Copy link
Author

Thanks for the context!

R.e. returning PutPart from ObjectStore, this would prevent implementation for LocalFilesystem and other stores that don't have native multipart.

I was thinking PutPart could be returned from MultiPartStore trait, not ObjectStore.
Currently although PutPart trait is public, there are no public implementations, which forces users to implement it for all stores themselves if they want to use it. It would be convenient for either MultiPartStore to return a private store-specific implementation, or to create a generic struct that would wrap a dyn MultiPartStore+location+upload_id and implement PutPart. The latter can be done by users of course, but arguably it would be better to have it in the object store library itself.

If that first part is done, then just exposing PutResult on the WriteMultiPart struct would resolve this issue. Let me know if you'd like a PR for that, I can do it.

@tustvold
Copy link
Contributor

tustvold commented Feb 29, 2024

The major challenge with returning PutPart from MultipartStore, is that we need to preserve the ability to resume existing uploads identified by MultipartId (#4961).

I need to think a bit on it more but the following changes make sense to me:

  • Add a MultipartUpload that can be created from a MultipartStore and MultipartId and implements PutPart
  • Update PutPart::complete to return PutResult
  • Add a method to WriteMultiPart to retrieve the PutResult if the upload is finished

These are all relatively straightforward changes that shouldn't be too painful for downstreams

@ashtuchkin
Copy link
Author

ashtuchkin commented Feb 29, 2024 via email

@tustvold
Copy link
Contributor

tustvold commented Mar 5, 2024

See #5458 for a more holistic re-imagining of how we handle streaming uploads

@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

For anyone following along, an API proposal is in this PR: #5500

@tustvold tustvold added the object-store Object Store Interface label Apr 17, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'object-store'} from #5500

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface
Projects
None yet
3 participants