Sort position links for faster non-equi joins#7097
Conversation
There was a problem hiding this comment.
Why do you need this. It incomplete and unused.
There was a problem hiding this comment.
Why is it incomplete? It is used both in SortedPositionLinks and in LookupSource directly, in case if there is no SortChannel for filter function.
There was a problem hiding this comment.
Add short javadoc what it is.
There was a problem hiding this comment.
please check comment if it suffice
There was a problem hiding this comment.
Either change signature to take "JoinFilterFunction" or allow empty() and set null to the field in that case.
There was a problem hiding this comment.
changed method signature
There was a problem hiding this comment.
don't do that. Extract separate isDescending(ComparisonType) method.
There was a problem hiding this comment.
drop those and just add default: section with return Optional.empty()
There was a problem hiding this comment.
rename to asBuildFieldReference
There was a problem hiding this comment.
Shouldn't you check some conditions about other side which is not buildFieldReference.
It seems like you are making some non-written assumption on shape of expression which is passed to this method.
Those should be exposed as Preconditions.checkArgument calls.
There was a problem hiding this comment.
yes, I have missed one thing. For such simple extractor, I have to check that there is no build field references on the other side of the inequality. Fixed
There was a problem hiding this comment.
unintended change. Reverted
There was a problem hiding this comment.
Please add as a reference implementation.
There was a problem hiding this comment.
This also should be marked as //TODO
There was a problem hiding this comment.
you were able to write compiler code, so it should be easy to write regular code, shouldn't it?
There was a problem hiding this comment.
At a time it wasn't crucial for performance tests and validation whether whole idea has any sense. I have added it now though.
There was a problem hiding this comment.
This cannot be part of state as class is shared between threads.
There was a problem hiding this comment.
Yes, that's one of the reasons why I think decoupling PositionLinks from JoinHash is a way to go. I should have marked this as a //TODO
There was a problem hiding this comment.
You can use multimap. Does it influence that you are using boxed types?
There was a problem hiding this comment.
I can not use Multimap, because I need to have last linked element as a starting point for the position links. Thus on each linking I would have to copy all collection returned by Multimap.get(...)
There was a problem hiding this comment.
You can use computeIfAbsent - https://docs.oracle.com/javase/8/docs/api/java/util/Map.html#computeIfAbsent-K-java.util.function.Function-
There was a problem hiding this comment.
Generally, I think these tests are not enough. It is similar with the current problem related to Exchanges.
The query will pass regardless you other commits are merged or not.
I am sure there are already some tests form inequality join.
There was a problem hiding this comment.
Tests are very WIP.
Btw, there doesn't seems to be tests, that cover those test cases with simple enough comparison expression :)
There was a problem hiding this comment.
I think you can easily add unit tests for that class. That would be nice because it would show expressions are supported.
There was a problem hiding this comment.
why empty? I think it would be nice to test the things here.
There was a problem hiding this comment.
I didn't want to spend a lot of time adding tests without initial approval that this thing will go into code base.
There was a problem hiding this comment.
Is that still valid? I mean there is approval on approach. Right?
There was a problem hiding this comment.
I think is not worth testing, since writing such test would be quite painful.
There are unit tests for both SortedPositionLinks and SortChannelExtractor and integration between JoinHash and PositionLinks interface is tested here regardless of this Optional.empty(). Also there are integration tests in AbstractTestQueries for this .
There was a problem hiding this comment.
you were able to write compiler code, so it should be easy to write regular code, shouldn't it?
There was a problem hiding this comment.
extract this as separate class
caa99ae to
b1581d2
Compare
fe2d411 to
086735a
Compare
|
@dain I have fixed concurrency issues without refactor (as discussed) and it's read for review. |
2fa1cd0 to
1ebf962
Compare
|
@dain I will give it one more read but feel free to go through it at the same time. |
losipiuk
left a comment
There was a problem hiding this comment.
Looks ok up to my understanding of the code.
I have some outstanding questions to @pnowojski to be asked offline.
Though, It should not block your review @dain - at least to extent to see if general approach is something we want to push forward or not.
There was a problem hiding this comment.
Maybe extract 1 variable (e.g MAX_MEMORY_USED). And set the pools as fractions of this one. So if we need to change sth here we just change one thing. Also set max-data-per-node config prop based on that.
There was a problem hiding this comment.
I would rather expect that in the future we will need more customization of those parameters. In other words, we will need passing those values as parameters. Binding them to one static value doesn't seem like a good idea, since they are quite independent.
There was a problem hiding this comment.
don't we need to save the results of query into some field to be sure that JVM does not need some weird code-removal optimizations?
There was a problem hiding this comment.
I have added collecting output into List<Page> to MemoryQueryRunner
There was a problem hiding this comment.
Why not return just new head instead boolean (either left or right)
There was a problem hiding this comment.
How end of list is signalled?
There was a problem hiding this comment.
Commit message suggests that you use Javas LinkedList as an implementation.
Maybe use Implement PositionLinks as link array as commit message and name the class PostionLinksArray?
There was a problem hiding this comment.
Or maybe LinkedPositionLinks will be good and match the SortedPositionLinks from latter commit?
There was a problem hiding this comment.
I have renamed it to ArrayPositionLinks, since it matches with Java's convention of ArrayList vs List.
There was a problem hiding this comment.
InelliJ says Can be replaced with Comparator.comparingLong
There was a problem hiding this comment.
yes, but I don't like it. This seems more easier to understand for me and it matches to reversed comparator in test below.
There was a problem hiding this comment.
Name this testJoinWithLessThanInJoinClause
Inequality is everything which is not equailty. E.g. LIKE, complex expressions etc.
At least it used that way in test name convention so far.
There was a problem hiding this comment.
Is that still valid? I mean there is approval on approach. Right?
131a7e0 to
3533940
Compare
|
Review status:
|
dain
left a comment
There was a problem hiding this comment.
Nice. I had a bunch of minor comments, but it is generally looking good. I noted one major concern about the stateful filter functions, but I think this will be easy to remedy.
This PR was pretty difficult to read because the features were broken into many small commits containing a single file. The is especially the case when a new class is added in one commit and then the test is in the next. It helps to read the code quickly if the commits grouped into features, so I can see how the different bits relate to each other and see usages in code and tests. For this review, I had to flip back and forth between commits. The collection into features (maybe mini-features) helps be focus on a single context which makes reading faster. Here is how I would have grouped the commits, and how I think you should squash them before we merge them:
Fix benchmarks using MemoryLocalQueryRunner
- 3533940 Collect output of the memory benchmarks
- Also apply the "output" fix to the other benchmarks using MemoryLocalQueryRunner
Increase memory limits for benchmarks using MemoryLocalQueryRunner
- a87c482 Bump memory limits for memory benchmarks
Add inequality join benchmarks
Add direct handling of inequality conditions in JoinHash
- 6b538a7 Add PositionLinks interface
- ca87b79 Implement ArrayPositionLinks
- b2e85b6 Implemented SortedPositionLinks with binary search
- 3d33e0b Add unit tests for PositionLinks
- 2ee7111 Add SortExpressionExtractor
- 3f9d0df Add unit tests for SortExpressionExtractor
- 87b119a Add tests queries for sorting position links
- 29b2f67 Extract sort channel from filter function
- f20ac3a Add compare method to PagesHashStrategy
- 4486d5e Use SortedPositionLinks in LookupJoinOperator
There was a problem hiding this comment.
Maybe add an implementation of OutputFactory an inner class, so you don't have to have this crazy 5 argument lambda :) ?
There was a problem hiding this comment.
I think you might be able to use PageConsumerOutputFactory instead of this.
There was a problem hiding this comment.
I still "you might be able to use PageConsumerOutputFactory instead of this".
There was a problem hiding this comment.
I'm am concerned about this change. The reason we have a FilterFunctionFactory is because filter functions are stateful and not thread safe. Each driver (and thus threads) calls get to fetch a private version of the JoinHash for their thread. I am surprised there are no join tests that use a stateful function in the join condition (or maybe the tests are not concurrent enough to see the problem).
There was a problem hiding this comment.
Grrr... :( This makes it more complicated. I can not build positionLinks in this get method, because it would be too costly to sort position links every time get is called.
But on the other hand, I can not return positionLinks created in the constructor with it's private final filterFunction here, because then we hit the same problem you are describing.
The only solution that I can see is to build/sort positionLinks in the constructor with temporarily created filterFunction, but before returning it in in get, this filter function must be overwritten by newly created one.
There was a problem hiding this comment.
Why not use IntComparator from fastutils?
There was a problem hiding this comment.
I think we normally put this call in the @Benchmark method and have the context just expose a getter for the runner.
There was a problem hiding this comment.
It might be easier to read if you named the tables a and b or if you dropped the aliases and rewrite the query to:
SELECT count(*) FROM t1 JOIN t2 using (bucket) WHERE t1.val1 < t2.val2There was a problem hiding this comment.
most of these fields are no longer used
There was a problem hiding this comment.
after fixing thread safety of FilterFunction they are now used again
There was a problem hiding this comment.
This is a utility class (all static methods), so it should be final
There was a problem hiding this comment.
actually it should be extractSortExpression. SortOrder was a remanent after a refactor/rename.
There was a problem hiding this comment.
Can we add a kill switch for this feature incase we find a bug?
pnowojski
left a comment
There was a problem hiding this comment.
I have applied changes, please check my responses to some of your comments.
I have also merged some of those commits, but IMO it's much harder to review one 1000 commit from 1200 lines PR. Indeed, smaller commits are often missing some greater context, but in such case if I have doubts I like to fallback to diff of all commits by whole PR to grasp what PR is about or to search for method usages.
On the other hand smaller commits allows for bundling some changes that belong to some topic. Like in this PR commit "Add compare method to PagesHashStrategy". It would just make unnecessary noise if squashed with other commits.
There was a problem hiding this comment.
fixed, Intellij was auto folding this so I missed it
There was a problem hiding this comment.
I have moved filterFunction to SortedPositionLinks constructor and dropped it from ArrayPositionLinks.
There was a problem hiding this comment.
Thanks, indeed much better to understand.
There was a problem hiding this comment.
I have replaced it and didn't notice performance improvement in benchmarks. However code is almost the same, so fastutil version can stay.
There was a problem hiding this comment.
that's not the only place that uses this assumption. This lines uses it to terminate iteration over position links. Same assumption is used to find a starting point of iteration.
I have added javadoc comment at the top of this class.
There was a problem hiding this comment.
I have added some explanation to the java doc, because I can not come up with an idea for better naming. Keep in mind that smaller and larger are defined using an external comparator, which is a transformation of the filterFunction
There was a problem hiding this comment.
wow, nice bug... It must have slipped through since it was just a poc...
There was a problem hiding this comment.
actually it should be extractSortExpression. SortOrder was a remanent after a refactor/rename.
There was a problem hiding this comment.
They execute faster in H2 then in presto.
There was a problem hiding this comment.
Grrr... :( This makes it more complicated. I can not build positionLinks in this get method, because it would be too costly to sort position links every time get is called.
But on the other hand, I can not return positionLinks created in the constructor with it's private final filterFunction here, because then we hit the same problem you are describing.
The only solution that I can see is to build/sort positionLinks in the constructor with temporarily created filterFunction, but before returning it in in get, this filter function must be overwritten by newly created one.
f29c054 to
35f85e5
Compare
|
@dain I have just approved. Thanks for notification ;) |
|
Applied comments. I have added some more benchmark cases to see if there is a performance drop for very short position links (larger number of buckets) and surprising I haven't seen that. So indeed I think we can turn this on by default. Here are the results. Number of elements in build table is |
Collect output of the memory benchmarks instead of ignoring it
|
The "performance drop" you are seeing is in the margin of error, so it doesn't mean anything. |
|
Maybe I wasn't clear enough. I meant, that I expected to see some performance regression in some edge cases, but it wasn't visible in benchmarks (which is good :) ) |
|
Merged, thanks! |
|
Thanks :) |
This is a implementation of:
#6922
From preliminary benchmarks this change didn't introduce regression in equality joins and for some in-equality joins this gives orders of magnitude speed ups.
BenchmarkInequalityJoinspeeds up from ~210ms down to ~50ms and existingBenchmarkHashBuildAndJoinOperatorsdo not show a visible performance regression:There are couple of TODOs (future work?):
comparemethod may not fit toPagesHashStrategySortedPositionLinksoverPositionLinksListNoopPositionLinks(that returns always-1) if all rows in build table have unique key (this would reduce memory usage and it might slightly improve performance)