-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: dagster-obstore #27450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: dagster-obstore #27450
Conversation
|
@dpeng817 you might taking a look at this? It will simplify the ComputeLogManagers a lot, and won't require heavy dependencies. All these stores can create pre-signed urls so that provides the download link to the log files. I've tested it for Azure and S3 and it works. |
|
Hey @ion-elgreco - I'd like to understand better the pain you were originally facing that led you to making this integration. Obstore seems like a really awesome technology, but it's also new, and it's yet another python dependency. It scares me a bit to take a dependency on it in our aws, azure, and gcp integrations. Aws especially is super high traffic - and the idea of introducing a dependency on a relatively green package which might stop being updated at any time I think makes it an initial no-go to try and replace the existing functionality in any of azure, aws or gcp with this. I think to start we'll want to have this live in the community integrations repo, which I understand will make it more annoying to, say, deploy from the helm chart, but unfortunately might be the necessary tradeoff for now. To make that part of things less painful, perhaps we could expose helm chart config for custom compute log managers (IE you can specify module and config arbitrarily for a custom compute log manager). Thoughts about that approach? |
I can understand, but it's not entirely new. The rust crate "object store" is battle tested and production grade, used in various databases, query engines and so forth. The python binding is a quite minimal layer to this rust crate. No updates at no time, won't even be that big of an issue, blob apis are stable so in theory you could use the same version for years but outside of that, the maintainer of it is very active so I don't think this will happen. And it happens, you could fork it and manually bump the crate version to get the latest updates Tbh, having it be part of the community integrations would be more overhead to me with no benefit, since it will take longer to push changes when needed. My goal is rather to make this part of the core maintained libraries and have it as optional in the helm chart and already included in the docker images, the obstore wheel itself is only 4MB. |
ce0de79 to
eaf9a04
Compare
f5b1405 to
de3e06a
Compare
|
@dpeng817 @danielgafni can you guys let me know whether this can be part of Dagster? I've added integration tests for Azure and AWS at least Otherwise, I will release it manually to pypi since I don't want to spend too much time on this |
|
Actually I'll remove myself from reviewers as I'm going to be less involved with Dagster starting from this week. I think @ion-elgreco your contribution definitely is valuable, and I personally believe it could simplify Dagster's codebase and dependency tree. We just need to find the right place for it. I'll let others decide whether it should be an experimental submodule in If you were going to release it separately, consider moving it to the community repo. Projects in the community repo are pretty much self-contained. You won't experience any friction there. |
|
@danielgafni Ah ok that's good to know, in that case the community integrations are also fine. I did a small check but creating a docker image with this package and without dagster-aws/azure/gcp cuts the image to 50% of the size :) |
|
@dpeng817 @danielgafni could you retrigger the CI? |
|
@danielgafni can you run it again? Not sure why the pip install was failing, hope it works now |
bf0f3ed to
9b5184b
Compare
|
@danielgafni could you unblock it again? I found the issue, I had to install dagster-pipes as well as editable.. (: |
|
Looks like Does this mean we could just use it with all the current |
Yes!! |
|
If you're able to use the async API, it would likely be most performant not to use fsspec and to use the general obstore APIs directly |
|
Superseded by: dagster-io/community-integrations#107 |
Summary & Motivation
Obstore is a python binding for the Rust crate Object-store, much faster than boto3 and other equivalent implementations for interacting with cloud object stores.
This supports S3,Azure,GCS from a single library.
How I Tested These Changes
Would need some help here, for AWS I just copied the tests from dagster_aws. Is this fine?
I've manually tested the Azure and S3 integration and it works, logs show in the UI, and I can download the logs with presigned urls.
Changelog