Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with object_store crate #3

Open
kylebarron opened this issue Feb 15, 2023 · 7 comments
Open

Integration with object_store crate #3

kylebarron opened this issue Feb 15, 2023 · 7 comments

Comments

@kylebarron
Copy link

👋 Hi! I'm a fan of fsspec in python and happy to see you working on it in Rust as well.

I wanted to make sure you were aware of the object_store crate, because I see that as an existing implementation of fsspec in Rust, and you appear to be re-implementing fsspec from scratch here. Connecting that to python might save some work?

@martindurant
Copy link
Owner

After a deeper look into the library than my original glance afforded, I agree that object_store seems to include everything that an rfsspec would need. I'll see what it takes to integrate (help appreciated!). Critically, we will want to allow for the full set of credential options currently supported by s3fs, gcsfs and adlfs. Of course, HTTP is the easiest.

Some thoughts:

  • http supports PUT but not POST
  • the clients seems to be per-bucket, do they share auth and connection resources?
  • has anyone tried using object_store via wasm in the browser? I would love to see that!

@kylebarron
Copy link
Author

kylebarron commented Feb 21, 2023

I believe object_store is a relatively young library (and I haven't used it much myself) so it probably makes sense to start an issue over there (in the arrow-rs repo) to ask some of your questions. I've been wondering whether object_store considers it in-scope to include more filesystem apis like fsspec, so I'll write up an issue (hopefully today) and cc you

  • has anyone tried using object_store via wasm in the browser? I would love to see that!

Yes! It's set up to be able to compile to wasm: https://github.com/apache/arrow-rs/tree/master/object_store#support-for-wasm32-unknown-unknown-target. As noted there, currently cloud integrations are turned off in wasm; not sure what the underlying complications are for those.

See also this PR: apache/arrow-rs#2896

@tustvold
Copy link

tustvold commented Feb 21, 2023

👋 object_store maintainer here

I've been wondering whether object_store considers it in-scope to include more filesystem apis like fsspec

The object_store crate is focused on the APIs that object stores can efficiently provide, on the basis that this is what 99% of workloads actually need, functionality such as directories, random access reads, etc... would therefore be considered out of scope.

That being said, I don't see a reason why object_store couldn't be used as the basis for a fsspec style implementation, with the unavoidable caveat that treating object stores as a filesystem requires prefetching heuristics, and generally does not yield the best experience. See apache/datafusion#2205 (comment) and apache/arrow-rs#1473 for more context on this if you're interested.

not sure what the underlying complications are for those.

I seem to remember some limitation of tokio's networking support, it was something at that level as opposed to something inherent to object_store itself.

@kylebarron
Copy link
Author

... as I was in the midst of writing an object_store issue 🙂

unavoidable caveat that treating object stores as a filesystem requires prefetching heuristics

I 100% agree that this is very tricky (I'll save those issues for later reading!). In my eyes the style of fsspec is to expose this to the user for them to choose how they want to handle this. fsspec in Python includes a variety of caching mechanisms that the user can add as they wish https://github.com/fsspec/filesystem_spec/blob/master/fsspec/caching.py

random access reads, etc... would therefore be considered out of scope.

I'm particularly interested myself in random access reads with an underlying block cache. I think something like fsspec in rust would be useful, and building on top of object_store would certainly make things easier.

@kylebarron
Copy link
Author

@martindurant what are your plans for this repo? Do you want this to only be a Python integration? Are you interested in a public Rust filesystem API that is also integrated into Python? If you're focused mostly on speeding up fsspec in Python, a rust filesystem API might be out of scope?

@tustvold
Copy link

I 100% agree that this is very tricky (I'll save those issues for later reading!)

The TLDR is that workload agnostic caching layers don't perform very well, DataBricks built their own integrated S3 reader for Spark, and the Hadoop ecosystem is working on adding vectored IO that maps better to the underlying object store requests

@kylebarron
Copy link
Author

Makes a lot of sense. I suppose it would be important to make such an fsspec caching layer an externally-implementable trait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants