Performance of RadixTree#getValuesForKeysStartingWith() #18

knutwannheden · 2016-09-22T10:59:25Z

Looking at the implementation of ConcurrentRadixTree#getDescendantValues() (as used by #getValuesForKeysStartingWith()) I get the impression that the performance could be improved by not using lazyTraverseDescendants(), as it creates all these intermediate NodeKeyPair objects, while here only the Node objects are relevant.

Maybe it would even be possible / make sense to support a visitor pattern. Then it should also be possible to skip the "stack" which now has to be maintained by the Iterator. For my use case that would also work very nicely, since I in the end want to process all returned values.

knutwannheden · 2016-09-22T11:42:03Z

FWIW, I did some very basic testing with a custom RadixTree#visitNodesStartingWith(String, NodeVisitor) method. NodeVisitor looks something like this:

interface NodeVisitor {
  boolean visit(Node node);
}

The return value is just there to allow the client to prune / abort. I suppose there should really be a second CharSequence key parameter as with KeyValuePair, but I didn't need that for my use case.

For large trees (I am testing with a tree with about 900K values) it looks like the performance is quite a bit better.

Would it be possible to include something along these lines? Would it require splitting the Node interface just so that the visitor doesn't have access to updateOutgoingEdge(Node) (at least not without some casting)?

npgall · 2016-09-22T23:20:44Z

Hi Knut,

Yes that sounds reasonable, and I'd be open to adding something like this.

It might indeed as you say require splitting the Node interface. The visitor must have read-only access.

I don't have much time to spend investigating or working on this myself at the moment. Is this something you could implement yourself, with a view to submitting a pull request? From what you said above, your approach and ideas make sense AFAICS.

So basically, if you could follow the code style/conventions in the project, and also maintain full test coverage, then I'd be happy to accept a pull request to add this feature. Of course I'd be available and more than happy to help out and answer any questions if you want to take this on!

Niall

knutwannheden · 2016-09-23T08:36:43Z

Hi Niall,

Thanks for the feedback. Yes, I would be willing to supply a corresponding PR. I will have to try to fit it in somehow...

Here some questions:

Compared to the rest of the RadixTree API this new visitNodesStartingWith() looks a bit alien, because it doesn't return an Iterable which computes the result lazily. Looking at how you use RadixTree in cqengine's RadixTreeIndex it doesn't look like you would be able to "benefit" from this new method there. I wouldn't want to pollute the API just for my own purposes. So I could alternatively just subclass ConcurrentRadixTree in my own code and add my method there. This would unfortunately require making ConcurrentRadixTree#searchTree() protected and also ConcurrentRadixTree.SearchResult and ConcurrentRadixTree.Classification and some of their members public. That is probably also not what you want...
visitNodesStartingWith() is the only method I need. But for completeness there should probably be equivalent methods for all other queries returning Iterables. And probably then also for all other tree implementations...

Thinking ahead I already know of one more method I would like to have. Something like an update(CharSequence key, Function<O, O> updateFunction) which would be similar to put(CharSequence key, O value) but where the value is computed by updateFunction which as a parameter receives the old value (or null if none). As you will be able to guess I need to squeeze out as much performance as possible.

So, if you don't see how you could benefit from these extensions anywhere in your projects, I would suggest to instead make ConcurrentRadixTree extendable as described above. If you don't like that idea so much, I would also be OK with closing this issue. I could then just "fork" your implementation.

knutwannheden · 2016-09-23T08:39:45Z

In any case I could supply a PR which optimizes ConcurrentRadixTree#getDescendantValues() to not use lazyTraverseDescendants() as described in my initial comment. The gains are probably rather small, but better than nothing.

npgall · 2016-09-23T11:34:19Z

Hi Knut,

Well I guess if you add a generic tree.visit(Visitor) method, then the actual algorithm to traverse the tree will be encapsulated in the Visitor object which is provided by the application. Does that make sense? So you would write a VisitNodesStartingWithVisitor in your application, and then supply this to the tree.visit() method.

If this is the case, then we would not need to add many API methods to the trees themselves at all, even to support esoteric traversal algorithms. Applications would be free to define their own Visitor objects, external to this library.

CQEngine isn't the only user of concurrent-trees. For example recently Cassandra database has been using these trees in SASI indexes. So if by adding generic support for the visitor pattern we would allow improved performance for power users without complicating things for regular users then I think this would be worthwhile.

knutwannheden · 2016-09-23T12:13:33Z

Hi Niall,

In my prototype I had the actual traversal algorithm in the tree. That may be a bit unusual for the visitor pattern, but I have seen so many different implementations of that pattern, so I am not even sure how unusual it actually is.

But what you suggest sounds like a good idea, as it doesn't pollute your API. I will look into it.

If it can be decoupled that way, it may even make sense to supply some standard visitors as part of concurrent-trees. Or maybe at least the tree implementation classes could provide an abstract visitor base class, so the visitors don't necessarily need to duplicate standard traversal logic as in ConcurrentRadixTree#searchTree().

Any thoughts or reservations about an update(CharSequence key, Function<O, O> updateFunction) method?

npgall · 2016-09-23T12:21:51Z

Yes sure, we could look at including some abstract or general purpose visitors as part of concurrent-trees.

Sorry I forgot to say - yes the idea for the update() method seems reasonable too!

knutwannheden · 2016-09-23T12:24:15Z

Great!

BTW, what's the minimum Java version supported by concurrent-trees? I am asking because I wonder if I can add a default accept(NodeVisitor visitor) method to Node or not.

npgall · 2016-09-23T12:53:36Z

It targets Java 6, so we can't use default methods I'm afraid :(

I use Java 8 personally, but I'd like to keep Java 6 compatibility for
other users who might need it.

Sent from my Android

On Sep 23, 2016 13:24, "Knut Wannheden" [email protected] wrote:

Great!

BTW, what's the minimum Java version supported by concurrent-trees? I am
asking because I wonder if I can add a default accept(NodeVisitor visitor)
method to Node or not.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#18 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACuJigV4CcWznQliVOOGI7F0Jdj-gDibks5qs8TvgaJpZM4KDw5u
.

knutwannheden · 2016-09-23T13:30:41Z

OK. Thanks.

RadixTree#searchForKeysStartingWith() replaces all of the three methods RadixTree#getKeysStartingWith(), #getValuesForKeysStartingWith(), and #getKeyValuePairsForKeysStartingWith(). The new method returns a TreeSearchResult object which declares the methods asKeyIterable(), asValueIterable(), and asKeyValuePairIterable(). So instead of calling tree.getKeysStartingWith("foo") a client would now call tree.searchForKeysStartingWith("foo").asKeyIterable(). The above refactoring allows for TreeSearchResult to be extended with other kinds of traversal methods. E.g. some kind of accept(Visitor) or a collectInto(Collection).

knutwannheden · 2016-09-27T09:24:06Z

I found a solution which avoids some of the problems I saw with a generic accept(Visitor) method: knutwannheden@7250696.

As you can see that solution currently breaks the API. That would not actually be necessary, I just wanted to make the intention of the refactoring as clear as possible and get your feedback on that. But I am thinking it may make sense to go in this direction and deprecate (and in the future derelease) some parts of the API.

This design would then make it simple to support other query result traversals like visitors or collectors.

knutwannheden · 2016-10-04T20:29:07Z

Hi Niall,

Did you find some time to give my proposal some thought? A slight variation would be to add some kind of InternalRadixTree interface, which every implementation of IRadixTree must implement. This interface could then declare the new methods and the existing interface would remain as-is.

The thing I don't like about the generic accept(Visitor) method is that I don't see how I can then avoid duplicating at least some of the logic in the current getXxxStartingWith() methods, which already contain some duplication. That is how I then came up with the proposed solution.

Knut

npgall · 2016-10-04T23:00:23Z

Hi Knut,

Sorry I've been away for a few days, but I got some time just now to look into this in more detail.

It's not really an option to change the existing APIs unless there would be a significant improvement. This library has become quite widely used so I'm cautious about making changes which would not be backward compatible, especially if they would not improve performance across the board.

In this case, the proposal would improve performance for the getValuesForXXX() methods, but it would not improve the performance of and it would introduce the creation of additional objects into the getKeysXXX() and getKeyValuePairsXXX() methods.

Your idea to restructure the APIs looks good, and ordinarily would totally make sense! But just in the case of this library I'm especially cautious about making significant changes internally because the existing code has been battle-tested in production for a number of years, and it's already in use in Cassandra and a number of apps.

But going back to your original question - your concern was about object creation in the ConcurrentRadixTree#getDescendantValues() method? Can we just refactor that method to improve performance in a targeted way without impacting the other methods?

I was able to eliminate the object creation in getDescendantValues() based on your excellent points in this branch. It seems quite straightforward and it's an isolated/targeted performance improvement. Would this be enough to solve the problem?

While it eliminates creation of the NodeKeyPair objects, it still uses a stack, but we need to use some kind of a stack to traverse the tree anyway right?

I'm still open to adding a visitor method if you'd like also.

knutwannheden · 2016-10-05T02:29:44Z

Hi Niall,

Thanks for getting back to me and thanks for the small performance improvement. I was planning to to create a separate PR for that. I have a few other small performance improvements I would like to suggest. I will create separate issues or PRs for those.

The improvement to getDescendantValues() is not quite enough for my performance target. I would need the visitor approach, in which there is no state being maintained (as it simply traverses all nodes recursively). That reminds my that I read in the documentation that you didn't want to use recursion because of possible stack overflows. I suppose that this would be a risk the client of the visitor method would have to accept.

Regarding the visitor method: What about adding the alternative methods I suggest (returning the search result object) without deleting the current methods? To not "pollute" the IRadixTree interface these methods could be declared in a "hidden / advanced" InternalRadixTree interface, to which the client would first have to cast to use them. This should then be backwards compatible. The existing methods could then be implemented in terms of these or not. It is just a single extra object being created and I don't see any room for a regression here. Or the methods could be added to IRadixTree and the current methods could be deprecated and scheduled for removal in a future major release.

With a generic visitor method there are still some methods like searchTree() which would have to be made public. Or at least the SearchResult class. Otherwise a custom visitor would also have to duplicate that logic, which wouldn't be so nice. Would you see that as a possibility?

Knut

npgall · 2016-10-05T07:52:51Z

Hi Knut,

Sounds good!

Yes the searchTree method could be made public. Or rather, I guess it would be a method similar to it (which perhaps wraps it) but which returns read-only views of Node objects as we discussed earlier?

Best regards,
Niall

knutwannheden · 2016-10-05T13:52:35Z

I wonder about the read-only Node objects. ConcurrentRadixTree already implements PrettyPrintable, which declares a getNode() method. True, this is somewhat of a hidden implementation detail, but it allows any client to traverse the entire tree and call updateOutgoingEdge() anywhere. Of course the PrettyPrinter would be an excellent candidate for the visitor method. So should we then also change ConcurrentRadixTree to no longer implement PrettyPrintable? Technically that would of course be a backward incompatibility, so we could instead change PrettyPrintable#getNode() to return a ReadOnlyNode (or whatever we decide to call that interface).

Sorry if I keep going on about this, but do I understand you correctly that you don't like the idea of adding an additional searchForKeysStartingWith() method to IRadixTree, even if we do that in a backwards compatible way (i.e. keep all existing methods)?

npgall · 2016-10-07T16:01:12Z

Hi Knut,

Yes the public interface RadixTree does not extend PrettyPrintable, so this is not part of the public API (intentionally). PrettyPrintable is indeed only an internal implementation detail. It could be changed or reimplemented as a visitor. Nobody should be depending on the current implementation. That said, I don't know if it will be worth changing it, as there's not a major benefit per-se - but it would indeed suit the visitor pattern quite well.

We should decide on the name of the read-only node interface. If we call it ReadOnlyNode, then it may be counterintuitive that a modifiable Node interface would extend it. What do you think about VisitableNode?

Also what naming conventions are you thinking regarding the visitor methods? -

interface RadixTree {
    ...
    void acceptVisitor(Visitor visitor);
}
...
interface Visitor {
    boolean accept(VisitableNode node);
}

Regarding the addition of searchForKeysStartingWith() etc. This method would just exist to allow a visitor to search for nodes starting with the current key, right? In other words, to make the existing algorithms accessible to a visitor?

If so then I don't think we should use this approach, because the benefit of the Visitor pattern should be that the traversal algorithms are decoupled from the tree. And then, that any kind of algorithms could be used to traverse the tree without a need for supporting methods in the public API.

So if you'd like to access methods such as searchTree() from visitors, another option would be to make those public static methods, and move them into a TraversalAlgorithms class? Would this work?

knutwannheden · 2016-10-10T21:04:08Z

The idea behind searchForKeysStartingWith() would be to offer that as an alternative to the existing three get*KeysStartingWith() methods. It would return some kind of search result, as in my Git branch, with e.g. an additional forEach(Consumer) method (similar as in Java 8). But that only makes sense if the existing three methods would eventually be phased out (i.e. mark them as deprecated now) or if the new searchForKeysStartingWith() method would be added to some hidden API for "power users" (e.g. an additional interface InternalRadixTree implemented by ConcurrentRadixTree). Of course this isn't really the visitor pattern, but a forEach(Consumer) would certainly fit my bill very nicely.

If a client needs to control the tree traversal, then I think it is probably enough to rely on PrettyPrintable returning the root and then call the Node#getOutgoingEdges() method recursively on the returned Nodes. Using Guava's TreeTraverser you wouldn't even have to implement the traversal algorithm. But of course a searchTree() utility method would still come in handy to find the starting node for a given key.

npgall · 2016-10-12T22:08:42Z

There is already a forEach(Consumer) method, inherited from java.lang.Iterable, and it's available when the library is run on Java 8.

As I mentioned, the existing methods are widely used, so I have no intention of breaking backward compatibility unless doing so will introduce some performance benefit or other improvement. So if this new visitor feature will be implemented, it must be done in a backward compatible way, with minimal disruption and changes to the library.

I don't think there would be any problem in moving the existing searchTree method into a separate TreeTraversals class though. As that way, that method will likely be inlined by HotSpot anyway, giving identical performance without API changes.

knutwannheden · 2016-10-13T09:10:25Z

Yes, with Java 8 the provided forEach() exists. But it would of course be backed by the stateful Iterator, which is exactly what I want to avoid 😃. I understand your requirement regarding compatibility and I think I now also understand that introducing a new API and in a future release deprecating (and later maybe even removing) the current API isn't an option either. Apologies for the back and forth, but that last bit wasn't clear to me before.

I will try to work out a change with a TreeTraversals class. This could then provide something like the forEach() or a visitor doing a recursive (and otherwise stateless) tree traversal.

knutwannheden · 2017-01-19T05:48:35Z

Would you for the 3.0 release consider changing the RadixTree API to include a searchForKeysStartingWith() method as in my Git commit?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of RadixTree#getValuesForKeysStartingWith() #18

Performance of RadixTree#getValuesForKeysStartingWith() #18

knutwannheden commented Sep 22, 2016

knutwannheden commented Sep 22, 2016

npgall commented Sep 22, 2016

knutwannheden commented Sep 23, 2016

knutwannheden commented Sep 23, 2016

npgall commented Sep 23, 2016

knutwannheden commented Sep 23, 2016

npgall commented Sep 23, 2016

knutwannheden commented Sep 23, 2016

npgall commented Sep 23, 2016

knutwannheden commented Sep 23, 2016

knutwannheden commented Sep 27, 2016

knutwannheden commented Oct 4, 2016

npgall commented Oct 4, 2016

knutwannheden commented Oct 5, 2016

npgall commented Oct 5, 2016

knutwannheden commented Oct 5, 2016

npgall commented Oct 7, 2016

knutwannheden commented Oct 10, 2016

npgall commented Oct 12, 2016

knutwannheden commented Oct 13, 2016

knutwannheden commented Jan 19, 2017

Performance of RadixTree#getValuesForKeysStartingWith() #18

Performance of RadixTree#getValuesForKeysStartingWith() #18

Comments

knutwannheden commented Sep 22, 2016

knutwannheden commented Sep 22, 2016

npgall commented Sep 22, 2016

knutwannheden commented Sep 23, 2016

knutwannheden commented Sep 23, 2016

npgall commented Sep 23, 2016

knutwannheden commented Sep 23, 2016

npgall commented Sep 23, 2016

knutwannheden commented Sep 23, 2016

npgall commented Sep 23, 2016

knutwannheden commented Sep 23, 2016

knutwannheden commented Sep 27, 2016

knutwannheden commented Oct 4, 2016

npgall commented Oct 4, 2016

knutwannheden commented Oct 5, 2016

npgall commented Oct 5, 2016

knutwannheden commented Oct 5, 2016

npgall commented Oct 7, 2016

knutwannheden commented Oct 10, 2016

npgall commented Oct 12, 2016

knutwannheden commented Oct 13, 2016

knutwannheden commented Jan 19, 2017