-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a DAG walker with support for IPLD Node
s
#39
Conversation
@Stebalien Could you take a look at this please? Particularly the node promise logic that was extracted from the DAG reader. |
@Mr0grog Could you take a second look at the DAG walker please? This is a spin-off from the initial PR you already reviewed, I have incorporated many of your suggestions and also have split the logic adding a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look and put down a few thoughts inline.
// the `Walker` responsibility. | ||
|
||
// If we drop to <= preloadSize/2 preloading nodes, preload the next 10. | ||
for i := childIndex; i < childIndex+preloadSize/2 && i < uint(len(nn.childPromises)); i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I’m reading right, it seems like it would be easy to wind up preloading a too many nodes by calling FetchChild()
out of order. For example, if I had a node with 30 children and called FetchChild(ctx, 20)
followed by FetchChild(ctx, 10)
, this check would act as if nothing was preloading at all and start loading children 10-19, even though it’s already loading children 20-29.
Not sure how big a concern that is right now, but if the eventual future of Walker
is async, it doesn’t seem like we should expect ordering to be reliable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this logic was inherited from the DAG reader and I need @Stebalien to check it.
navipld.go
Outdated
childPromises []*NodePromise | ||
// TODO: Consider encapsulating it in a single structure alongside `childCIDs`. | ||
|
||
serv NodeGetter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s not public, so not that big a deal, but I don’t feel like this abbreviation is a very clear name. What about getter
?
walker.go
Outdated
// | ||
// TODO: The search could be extended to search from the current position. | ||
// (Is there something in the logic that would prevent it at the moment?) | ||
func (w *Walker) Search(visitor Visitor) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I’d say the same thing as I did before about this name — Seek()
sounds a lot clearer to me because what you’re really doing here is navigating a direct path to a leaf. It only works as a search if the the field you are searching on is indexed in every node of the tree. For example, you can search for a leaf by byte offset in a UnixFS tree with this, but if I want to search for the text “hello” in a text file in a UnixFS tree, this function won’t be useful.
Second, since this still calls the visitor on the leaf node, I feel like I’m back in the position where it doesn’t feel like this offers much over just calling Iterate()
like I described here: ipfs/kubo#5257 (comment) The only difference between what you’d write with Iterate()
vs. Search()
is that you’d have to call Pause()
before returning on a leaf (and you’d already be having to check whether you’re on a leaf anyway). The logic in your visitor would be the same in both cases. Why not make this function do a little more?
func (w *Walker) Seek(visitor Visitor) (NavigableNode, error) {
err := w.Iterate(func(node NavigableNode) error {
// If we’ve arrived at a leaf, we’re done! \o/
if node.ChildTotal() == 0 {
return errPauseWalkOperation
}
err := visitor(node)
if err == ErrNextNoChild || w.ActiveChildIndex() >= w.ActiveNode().ChildTotal() {
// You probably want a special, public error for this case, but this
// reproduces what the current commit does.
return errPauseWalkOperation
}
return err
})
return w.ActiveNode(), err
}
That way, the visitor you use with Search()
(or Seek()
or whatever) can have simpler logic — it only has to navigate and not worry about processing the final node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, you can search for a leaf by byte offset in a UnixFS tree with this, but if I want to search for the text “hello” in a text file in a UnixFS tree, this function won’t be useful.
I'll rename it to Seek
if you think it's more appropriate, but just to clarify, this doesn't operate on the UnixFS layer (or any particular IPFS layer for that matter), it operates on the more abstract concept of a DAG, any DAG (not just the MerkleDAG). What you search/seek for depends on what those generic (not necessarily IPLD) nodes encode, it might be an offset but it might as well be a string, it's the consumer of the walker who defines what information is encoded and how it's processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second, since this still calls the visitor on the leaf node, I feel like I’m back in the position where it doesn’t feel like this offers much over just calling
Iterate()
like I described here: ipfs/go-ipfs#5257 (comment)
Yes, and I'm still giving more importance to what you (very well) described:
But you’re right; if we changed the API to Search, we could do something like this, that might be conceptually clearer (at the cost of more error handling code):
But you have a good point, I'm just prioritizing the concept over the code that lies inside the walker (which the user should not be forced to study to use it).
The only difference between what you’d write with
Iterate()
vs.Search()
is that you’d have to callPause()
before returning on a leaf (and you’d already be having to check whether you’re on a leaf anyway).
I'm not sure I understand this part, the advantage with Search()
is that the check for a leaf node (i.e., nowhere to go from here) is done inside the walker method so the user doesn't need to worry about it. In contrast, using Iterate()
to perform a search/seek lays that burden on the user (granted, small burden, but still if I can avoid it without making too much of a mess inside the walker I think it's worth it).
I think that what may be confusing is the way the DAG reader uses the walker when seeking inside a file. It does check for leaf nodes, but not because of the DAG seek operation itself but because of the way UnixFS works, which encodes information in the leaf nodes differently than in internal nodes. But those two are decoupled concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to clarify, this doesn't operate on the UnixFS layer (or any particular IPFS layer for that matter)
Yep, that was clear. I guess I could have said “search for some arbitrary string of bytes” instead 😉
I'm not sure I understand this part, the advantage with
Search()
is that the check for a leaf node (i.e., nowhere to go from here) is done inside the walker method
Maybe I’m misreading, then. The code looks to me like the user-supplied Visitor
function will be called with the leaf. It just keeps going down until there’s no child to enter. It seems like it’s still up to the user to determine whether their visitor should be path-seeking or consuming the found node on any given call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks to me like the user-supplied
Visitor
function will be called with the leaf. It just keeps going down until there’s no child to enter.
Yes.
It seems like it’s still up to the user to determine whether their visitor should be path-seeking or consuming the found node on any given call.
Could you expand on that? What does it mean to be path-seeking? The way I thought of it (but this may not be the actual use case) is that you're presented with the choice on every bifurcation, either you found your node and stop or you use the current node to steer the seek/search down. There's no need to worry about the special leaf node case, the search would just stop on its own and you'd end up with nothing (no node found). If using Iterate
the user must supply the "is this the leaf node" case, and then issue another pause to stop the search (but not because they found what they we're looking for but because there's nowhere to go).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to be path-seeking?
Running logic that determines where to go in the DAG (i.e. determining whether and how many times to call walker.NextChild()
).
The way I thought of it (but this may not be the actual use case) is that you're presented with the choice on every bifurcation, either you found your node and stop or you use the current node to steer the seek/search down.
I guess what I’m stuck on is two things:
-
If you are on a leaf, you are by definition already done. Why should the visitor have to deal with that leaf?
-
We know it’s common to have leaves that are differently formatted than intermediate nodes (UnixFS is a great example). If that’s true, then your visitor has to be more complex:
err = walker.Seek(func(node NavigableNode) error { // Figure out how to treat this node if node.ChildTotal() > 0 { // Do some seeking: determine whether to skip children, etc. // call `walker.NextChild()` as needed } else { // Handle the found leaf node node // Even if I don't care about this case and don't need the `else` // here, I still need the `if` above } return nil })
But an implementation that never sends the leaf to the Visitor lets the Visitor be much simpler and only worry about one thing:
err = walker.Seek(func(node NavigableNode) error { // Do some seeking: determine whether to skip children, etc. // call `walker.NextChild()` as needed return nil })
If using
Iterate
the user must supply the "is this the leaf node" case, and then issue another pause to stop the search
So my concern above is that, with this Seek
implementation, they still have to to supply the “is this the leaf node” case. It feels to me like, if the only difference is that I don’t have to explicitly call Pause()
, I haven’t gained much. To bring back a [slightly modified] comparison from the original comment on the other PR:
err = walker.Seek(func(node NavigableNode) error {
if node.ChildTotal() > 0 {
// Do some seeking: determine whether to skip children, etc.
// call `walker.NextChild()` as needed
}
return nil
})
// Achieving the same functionality with Iterate:
err = walker.Iterate(func(node NavigableNode) error {
if node.ChildTotal() > 0 {
// Do some seeking: determine whether to skip children, etc.
// call `walker.NextChild()` as needed
} else {
// This line is literally the only difference
walker.Pause()
}
return nil
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my concern above is that, with this
Seek
implementation, they still have to to supply the “is this the leaf node” case. It feels to me like, if the only difference is that I don’t have to explicitly callPause()
, I haven’t gained much.
(The next explanation is not that important, I'm fine going either way with this API).
Those are two different uses of the check node.ChildTotal() > 0
. UnixFS (and probably many more) use a special node type for the leaf and the implementation will need to check for that, but (ideally) that should be handled in the UnixFS implementation and not here (e.g., UnixFS should provide a function to tell if the current node contains the target offset, is lower or higher).
For example, I had to introduce a function in UnixFS to extract file data independently of the node type so it becomes more transparent to the consumer and that way the Visitor
passed to Iterate
in the DAG read method doesn't need to care about the leaf node check (I added it anyway as a performance improvement but it might as well not be there).
I'm not sure if we'll introduce a similar function to seek offsets and avoid the node.ChildTotal() > 0
check but decoupling those uses helps clarify (I hope) who is doing what and why. Sometimes for performance reasons we collapse those use cases but that normally makes the code harder to understand.
walker.go
Outdated
// This method doesn't change the `ActiveNode`, it just changes where | ||
// is it pointing to next, it could be interpreted as "turn to the next | ||
// child". | ||
func (w *Walker) Next() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming this to NextChild()
or SkipChild()
might be clearer — if I’m calling this from inside an Iterate()
visitor, “next” might sound more like it’s just getting the next node that would be iterated.
|
||
// NavigableNode is the interface the nodes of a DAG need to implement in | ||
// order to be traversed by the `Walker`. | ||
type NavigableNode interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this lets you move a lot of preloading logic out of the walker, which is nice, but is it possible you are optimizing for a more abstract use case that we might not need, or might not need for a long time? Adding this interface makes usage of the Walker
more complicated for any normal IPLD code:
// Box my node in a NavigableNode
boxedRoot = NewNavigableIPLDNode(root, dagService)
walker = NewWalker(ctx, boxedRoot)
walker.iterate(func(boxedNode NavigableNode) error {
// Unbox my node so I can use it
node = ExtractIPLDNode(boxedNode)
// Do something
})
// vs.
walker = NewWalker(ctx, root)
walker.iterate(func(node Node) error {
// Do something
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is it possible you are optimizing for a more abstract use case that we might not need, or might not need for a long time?
The idea of the walker is to use it (or evaluate using it) not only in the IPLD node case, this can operate on Gx dependencies, MFS directories, HAMT shards, etc.
Adding this interface makes usage of the
Walker
more complicated for any normal IPLD code:
Yes, since Node
is already an interface I don't see a cleaner way to do this (maybe @Stebalien can help here). For other entities represented with concrete structures, like a Gx Dependecy
, we need to only add the NavigableNode
methods to it and we will transparently be able to use it in the Visitor
functions without any weird conversion (just type-casting it to Dependecy
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of the walker is to use it (or evaluate using it) not only in the IPLD node case, this can operate on Gx dependencies, MFS directories, HAMT shards, etc.
Right, that was clear in your description of the PR. I guess what I was trying to ask was: are those actual, real use cases you need to solve here, or just theoretical ones you envision because they feel similar?
The answer might be yes (they’re real)! But I’ve also seen plenty of code that is over-complicated because it was trying to solve a set of theoretical use cases that nobody actually needed it to and that it was never put into practice for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We definitely need a more general description of what a node is (not an IPLD node), hence this NavigableNode
interface, not only for the walker but for other async fetch operations (now being implemented in the HAMT). But no, this hasn't been tested anywhere else (I need to get this passed first), so it's something to keep in mind. Over-complicating this would make it harder to use elsewhere, when I created the NavigableNode
I started adding too many methods that I ended up removing (or commenting) to take it now to this reduced 2 method version: tell me how many children you have and fetch the children X.
walker.go
Outdated
// of recursive ones) that expose the `path` of nodes from the root to | ||
// the `ActiveNode` it currently points to. | ||
// | ||
// It provides a `Visitor` function to perform user specified logic on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I don’t think wording quite makes sense. It’s the user of the Walker that provides the visitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the user should provide the Visitor
implementation, i.e., the logic of what to do with the nodes we're visiting. I meant that the Walker
provides the Visitor
definition itself to allow the user that kind of flexibility. But you're right, that is confusing, how would you word it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
It provides multiple ways to walk through the graph (e.g.
Iterate
andSeek
). When using them, you provide aVisitor
function that will be called for each node the Walker traverses. The Visitor can read data from those nodes and, optionally, direct the movement of the Walker by callingPause()
(to stop traversing and return) orNext()
(to skip a child and its descendants).[maybe put a quick example here]
[I might skip discussion of
up
anddown
here and describe them elsewhere since those are really implementation details.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[I might skip discussion of
up
anddown
here and describe them elsewhere since those are really implementation details.]
Where? We had similar issues when working on the balanced builder. It's true that this is too much for a user who just wants to walk a DAG without bothering how it's done, but also these are general implementation characteristics that shouldn't be buried down in a particular method.
Which could be an intermediate place? A loose comment paragraph after the Walker
struct definition (so it doesn't appear in the GoDoc)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three options in my mind:
-
Do you need to describe them collectively at all, or are their own function-level doc comments enough? It feels like the implementations of
Iterate
andSeek
are very clear and readable, so reviewing how they usedown
andup
in narrative text instead of code might not be that big a deal. -
A separate “guide to the implementation” comment at the top of the file, separate from the struct definition.
-
Your suggestion — a loose comment paragraph after the struct definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with number 3, but I also like option 2, I'll keep it in mind for the future.
// Can't keep going down from this node, either at a leaf node | ||
// or the `Visitor` has moved the child index past the | ||
// available index (probably because none indicated that the | ||
// target node could be down from there). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of noted this in the comment above, but it seems like this should be a new type of public error if we aren’t on a leaf, right? In that case, we failed to actually find what we were looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think I get what you're saying, Seek
is returning nil
for two cases, if it was paused (potentially because the user found what it was looking for) but also if we've reached the end. The user would normally be able discriminate what situation it's in since only the user can request a pause (the walker internally never pauses), but would you prefer to return a specific ErrNotFound
in the second case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’s reasonable that explicit pausing and pausing because we reached the end are the same. (It looks to me like reaching the end here ought to be the normal situation, right? So returning an error would feel odd to me.)
I was thinking more of this case:
err := walker.Seek(func(node NavigableNode) error {
// Say this node only has one child, but our logic naively moved
// past it because the offsets provided in the node were bad or
// because we just wrote poor code
walker.NextChild()
return nil
})
err == nil
The visitor would have never seen a leaf node, but nothing here makes it obvious something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to check NextChild
for errors, that would be a case where something went wrong.
The visitor would have never seen a leaf node, but nothing here makes it obvious something went wrong.
The visitor would have seen (at least) the root node, which in your example is a leaf node (empty DAGs are not allowed) so I wouldn't consider it wrong, it would be like searching an array of only one value (a sad search, yes, but I wouldn't see it as an error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err := walker.Seek(func(node NavigableNode) error { // Say this node only has one child, but our logic naively moved // past it because the offsets provided in the node were bad or // because we just wrote poor code walker.NextChild() return nil }) err == nil
OT: Thanks for providing this kind of detailed examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The visitor would have seen (at least) the root node, which in your example is a leaf node (empty DAGs are not allowed)
I don’t think you’d have to have an empty DAG to hit that example, though. Wouldn’t this DAG fit the example?
root
|
+---------+----------+
| |
node 2 node 4
| |
| +----+----+
| | |
node 3 node 5 node 6
That said, I just realized I misread something: I thought NextChild()
returned an ErrDownNoChild
, which would get swallowed by Seek()
, so this wouldn’t do any good:
err := walker.Seek(func(node NavigableNode) error {
// Say this node only has one child, but our logic naively moved
// past it because the offsets provided in the node were bad or
// because we just wrote poor code
return walker.NextChild()
})
err == nil
but it returns ErrNextNoChild
, so it’s fine. I guess this isn’t as big a problem, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it returns
ErrNextNoChild
, so it’s fine.
Yes, exactly. Reaching the end would mean that the first call to NextChild
would fail (as it would normally fail when you skipped all the existing children of the current node). But that shouldn't impact the user's code (you could add a special check for a leaf node so you wouldn't bother to call NextChild
even once, but that wouldn't improve the performance much).
Thanks for reviewing this (again) @Mr0grog !! |
@Stebalien This has low priority but do know that your review is needed to move ahead with this (and apply it in the DAG reader, ipfs/go-unixfs#12). |
With this second round of review I'm pretty satisfied about the documentation and general design (thanks to @Mr0grog) so I'll tag this as blocked pending @Stebalien's review. |
So as not to block this indefinitely, I'm assigning @warpfork (who will be working on the IPLD APIs anyways). I'll trust both of your judgments, merge when you agree on something. |
/cc @warpfork |
walker.go
Outdated
@@ -0,0 +1,439 @@ | |||
package format | |||
|
|||
// TODO: This can be extracted from the `go-ipld-format` repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slightly more expectant todo than I'd personally like to have in tree :) Maybe it can; probably it can; but we've got so many repos already that a todo suggesting yet another one gives me a bit of a willies.
I'd be happy with this being merged: LGTM 👍 I'd also +1 some of @Mr0grog 's comments about being cautious of speculative generality, though. Long run, I'd rather make Really Good traversables for IPLD nodes; and come up with more smooth interfaces to mapping other kinds of data into our existing node interfaces, rather than try to maintain a hypergenericized traversables library with a whole set of parallel node interfaces (in golang, nonetheless, where anything "generic" is already... shall we say "a big ask"). Introducing the interface in this PR makes sense here and now, and I'm 👍 to this code, but I'd be 👎 to trying to generalize it further or extract it into another repo and give it further generic-beyond-ipld life; another repo and another drifting set of interfaces with separate git history would heap on more overhead when trying to keep things developing in the same direction in the future. |
@warpfork So, actually the title is misleading since I haven't updated it to reflect the current code, this is actually a DAG walker, and by DAG I just mean a graph of any kind of node, not just IPLD nodes. I've seen over and over again code repeating itself just do a basic DAG iteration (where a DAG is the known DAG of IPLD nodes, but also a filesystem hierarchy, a Gx dependency graph, etc). The actual IPLD part is encapsulated in the That said, if it helps move things forward let's leave all the code here, and evaluate separating the |
Add a `Walker` structure with a simple interface to iterate and search inside DAGs, where a DAG need not be DAG of IPLD nodes but any graph of nodes meeting the `NavigableNode` interface. Add a `NavigableIPLDNode` structure (encapsulating the IPLD `Node` interface) that implements the `NavigableNode` interface that allows to traverse a DAG of IPLD `Node`s with support for node promises. License: MIT Signed-off-by: Lucas Molas <[email protected]>
Node
s
@warpfork Removed the |
... I actually thought I hit the approve button yesterday, my bad :) |
Great, thanks for taking the time to review and unblock this! |
Woohoo excited to check back in and see this was merged! Yay @schomatis ! |
Yes. Thanks @schomatis for your patience on this and thanks @warpfork for reviewing! |
Imported from ipfs/kubo#5257. Used in ipfs/go-unixfs#12.
This PR creates a DAG walker to abstract the DAG traversal logic from other parts of the code like the DAG reader. The walker itself started very coupled with the IPLD node logic but now works on a more generic
NavigableNode
interface that would permit to operate on a more general concept of node (e.g., a Gx dependency), so thewalker.go
file can now be extracted elsewhere (where? new repo?).