Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal to refactor with traits and trees #162

Open
ExpandingMan opened this issue Apr 7, 2022 · 4 comments
Open

proposal to refactor with traits and trees #162

ExpandingMan opened this issue Apr 7, 2022 · 4 comments

Comments

@ExpandingMan
Copy link
Contributor

I started #159 a while ago and since could mostly not decide what it should look like.

I've recently had some much better thoughts about all this stuff, and here's my proposal.

AbstractPath is a tree, always

The central problem we've had with S3 is that S3 does not understand that it is supposed to be a tree. In retrospect, this seems like a poor design choice on the part of S3. That's not to say there's something wrong with pure key-value stores, there isn't, but people are trying to copy S3 to and from trees all the time. This means that everything that interacts with S3 has to re-implement a tree on top of it.

Obviously S3 is not going to change, but I think this discussion supports the idea that there is no getting away from treating S3Path like a tree.

Therefore, FilePathsBase should NOT try to represent objects that are not trees, meaning that function such as isdir which distinguish leaves from non-leaf nodes on trees are mandatory and must return logically consistent results regardless of the AbstractPath type (though we should be able to mitigate their cost in the case of key-value stores).

The difference between DirectoriesExplicit and DirectoriesImplicit

In my proposed trait system paths either have directoriestype(::PathType) of DirectoriesExplicit() or DirectoriesImplicit(). From the above discussion we see that

  • DirectoriesExplicit: children (getting immediate child nodes) is cheap.
  • DirectoriesImplicit: leaves (getting descendant leaf nodes) is cheap.

Therefore FilePathsBase should not attempt to call children (readdir) directly on DirectoriesImplicit paths.

Implementation

The DirectoriesExplicit code can mostly stay as is, however, it needs to be re-optimized. Tree functions such as parent need to be made more efficient (I have done this already, but I may not have made a commit as of writing).

For DirectoriesImplicit, the missing piece is a function which constructs an explicit tree from leaves by parsing the path names of the leaves. Functions such as walkpath would then have to traverse the explicitly constructed tree.

I strongly suggest this package adopt AbstractTrees.jl so that it can take advantage of the generalizations and optimizations therein. This should eliminate the vast majority of non-parsing logic from FilePathsBase itself (i.e. the most fiddly functions in src/paths.jl).

I expect that doing this properly should also solve problems such as #145 .

Next steps

I'm willing (and eager, considering the current rather broken state of the ecosystem and the frustrating importance of AWSS3.jl) to carry this through to completion, I'm posting here because I'd like some feedback as I am not a maintainer. I would like to see this PR by Keno to AbstractTrees merged, tagged and fully documented first, so my next step is to help out with that.

Afterwards, I can turn my attention to FilePathsBase. I'm hopeful that everyone will be pretty enthusiastic about the changes I wind up making because at this point I'm pretty confident they will lead to drastic simplifications in both FilePathsBase.jl and AWSS3.jl.

cc @ericphanson

@c42f
Copy link

c42f commented May 19, 2022

Hi @ExpandingMan it seems to me that what you're working toward here and in #159 is similar to part of my work in JuliaComputing/DataSets.jl#45: I want a tree-like abstraction for data which can be used from Julia code, but which is independent of the underlying storage.

Over there I've got file and tree types which are a bare git-like tree-of-blobs abstraction data model, intended to abstract the storage layer. (ie, storage could be on the local disk, on S3, in a git repo, in a tar file, etc etc)

Unlike FileTrees.jl, DataSets.jl doesn't have an opinion on processing methodology: it just wants to be good at storing and retrieving data. I'm using AbstractTrees.jl, but also working on functions for mutation which are more applicable to data storage systems.

@ExpandingMan
Copy link
Contributor Author

Thanks, I really ought to take a look at it for DataSets.jl. It may well be the case that my biggest motivating use case (Parquet2.jl) is better off with DataSets.jl than FilePathsBase.jl. However, I still think it's important to address the problem of packages like AWSS3.jl. What is currently happening is that FilePathsBase makes a lot of assumptions that are only likely to be true on local file systems, and in so doing defers a lot of the tree logic to the file system itself. This is great for true file systems, but doesn't work very well for weird things like S3. Perhaps it's a mistake to conflate them: certainly I'd never even attempt it were it not for the fact that so much existing software already attempts to treat S3 like a file system.

One approach that we might consider is that maybe FilePathsBase is perfectly great as is, but you really only should use it for true file systems where latency is not a major issue and other systems like S3 need a completely different package, maybe DataSets.jl.

Btw, I just authored a huge overhaul of AbstractTrees.jl. We are going to be checking how bad the breaking changes are in the ecosystem before tagging (it will be 0.4). We'll be in touch if DataSets.jl requires more than very minor changes.

@c42f
Copy link

c42f commented May 26, 2022

maybe FilePathsBase is perfectly great as is, but you really only should use it for true file systems

I think this may be true.

In working on DataSets.jl I've come to the conclusion that absolute paths are massively overrated as a general abstraction. Relative paths, on the other hand, are great and I provide a system-independent relative path literal in DataSets.jl. I wish I could rely on FilePathsBase for this, but it appears that the p"" macro is system-dependent.

I think one of the main problems with path types is that they don't have an API which goes with them, unless you assume they're just filesystem paths. You can't open() a path in general; what abstraction should that return? I wrote more about this here: JuliaLang/Juleps#26 (comment)

@c42f
Copy link

c42f commented May 26, 2022

Parquet2.jl

By the way I love this package's design and I'm hoping to build a TabularDataSets.jl which plugs into DataSets.jl and relies on Parquet2.jl for serialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants