Skip to content

Conversation

@ion-elgreco
Copy link
Contributor

@ion-elgreco ion-elgreco commented Jan 30, 2025

Summary & Motivation

Obstore is a python binding for the Rust crate Object-store, much faster than boto3 and other equivalent implementations for interacting with cloud object stores.

This supports S3,Azure,GCS from a single library.

How I Tested These Changes

Would need some help here, for AWS I just copied the tests from dagster_aws. Is this fine?

I've manually tested the Azure and S3 integration and it works, logs show in the UI, and I can download the logs with presigned urls.

Changelog

Insert changelog entry or delete this section.

@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Jan 30, 2025

@dpeng817 you might taking a look at this?

It will simplify the ComputeLogManagers a lot, and won't require heavy dependencies. All these stores can create pre-signed urls so that provides the download link to the log files.

I've tested it for Azure and S3 and it works.

@danielgafni danielgafni self-requested a review January 31, 2025 14:18
@dpeng817
Copy link
Contributor

Hey @ion-elgreco - I'd like to understand better the pain you were originally facing that led you to making this integration. Obstore seems like a really awesome technology, but it's also new, and it's yet another python dependency. It scares me a bit to take a dependency on it in our aws, azure, and gcp integrations. Aws especially is super high traffic - and the idea of introducing a dependency on a relatively green package which might stop being updated at any time I think makes it an initial no-go to try and replace the existing functionality in any of azure, aws or gcp with this.

I think to start we'll want to have this live in the community integrations repo, which I understand will make it more annoying to, say, deploy from the helm chart, but unfortunately might be the necessary tradeoff for now.

To make that part of things less painful, perhaps we could expose helm chart config for custom compute log managers (IE you can specify module and config arbitrarily for a custom compute log manager).

Thoughts about that approach?

@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Jan 31, 2025

Hey @ion-elgreco - I'd like to understand better the pain you were originally facing that led you to making this integration. Obstore seems like a really awesome technology, but it's also new, and it's yet another python dependency. It scares me a bit to take a dependency on it in our aws, azure, and gcp integrations. Aws especially is super high traffic - and the idea of introducing a dependency on a relatively green package which might stop being updated at any time I think makes it an initial no-go to try and replace the existing functionality in any of azure, aws or gcp with this.

I think to start we'll want to have this live in the community integrations repo, which I understand will make it more annoying to, say, deploy from the helm chart, but unfortunately might be the necessary tradeoff for now.

To make that part of things less painful, perhaps we could expose helm chart config for custom compute log managers (IE you can specify module and config arbitrarily for a custom compute log manager).

Thoughts about that approach?

I can understand, but it's not entirely new. The rust crate "object store" is battle tested and production grade, used in various databases, query engines and so forth. The python binding is a quite minimal layer to this rust crate.

No updates at no time, won't even be that big of an issue, blob apis are stable so in theory you could use the same version for years but outside of that, the maintainer of it is very active so I don't think this will happen. And it happens, you could fork it and manually bump the crate version to get the latest updates

Tbh, having it be part of the community integrations would be more overhead to me with no benefit, since it will take longer to push changes when needed.

My goal is rather to make this part of the core maintained libraries and have it as optional in the helm chart and already included in the docker images, the obstore wheel itself is only 4MB.

@ion-elgreco
Copy link
Contributor Author

@dpeng817 @danielgafni can you guys let me know whether this can be part of Dagster? I've added integration tests for Azure and AWS at least

Otherwise, I will release it manually to pypi since I don't want to spend too much time on this

@danielgafni danielgafni removed their request for review February 3, 2025 16:51
@danielgafni
Copy link
Contributor

danielgafni commented Feb 3, 2025

Actually I'll remove myself from reviewers as I'm going to be less involved with Dagster starting from this week.

I think @ion-elgreco your contribution definitely is valuable, and I personally believe it could simplify Dagster's codebase and dependency tree.

We just need to find the right place for it. I'll let others decide whether it should be an experimental submodule in dagster or a separate community integration. I think both could work.

If you were going to release it separately, consider moving it to the community repo. Projects in the community repo are pretty much self-contained. You won't experience any friction there.

@ion-elgreco
Copy link
Contributor Author

@danielgafni Ah ok that's good to know, in that case the community integrations are also fine.

I did a small check but creating a docker image with this package and without dagster-aws/azure/gcp cuts the image to 50% of the size :)

@ion-elgreco
Copy link
Contributor Author

@dpeng817 @danielgafni could you retrigger the CI?

@ion-elgreco ion-elgreco requested a review from neverett as a code owner February 4, 2025 11:11
@ion-elgreco
Copy link
Contributor Author

@danielgafni can you run it again? Not sure why the pip install was failing, hope it works now

@ion-elgreco
Copy link
Contributor Author

@danielgafni could you unblock it again? I found the issue, I had to install dagster-pipes as well as editable.. (:

@danielgafni
Copy link
Contributor

danielgafni commented Feb 4, 2025

Looks like obstore has an integration with fsspec.

Does this mean we could just use it with all the current fsspec/upath code?

@ion-elgreco
Copy link
Contributor Author

Looks like obstore has an integration with fsspec.

Does this mean we could just use it with all the current fsspec/upath code?

Yes!!

@kylebarron
Copy link

If you're able to use the async API, it would likely be most performant not to use fsspec and to use the general obstore APIs directly

@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Feb 5, 2025

Superseded by: dagster-io/community-integrations#107

@ion-elgreco ion-elgreco closed this Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants