Lazily build hashtable for MapBlock#11791
Conversation
findepi
left a comment
There was a problem hiding this comment.
i just skimmed. I didn't intend to review this.
There was a problem hiding this comment.
add requireNonNull (or explanatory comment)
There was a problem hiding this comment.
Why not access this.hashTables directly? you do this when you complete computation anyway
electrum
left a comment
There was a problem hiding this comment.
There are lots of changes here, many of which seems to be refactorings. Can you pull those into separate commits so that it’s esiser to review and see the real change?
|
@electrum Hi David, I have removed the formatting changes (breaking long lines). Now the changes should be all related to the logic change. Please let me know if this is what you want, thanks! |
dain
left a comment
There was a problem hiding this comment.
Some minor comments/suggestions from my first read.
There was a problem hiding this comment.
Invert this if condition to remove a level of nesting. Something like this:
if (this.hashTables != null) {
return this;
}There was a problem hiding this comment.
Also, invert this one, and move to right after the start of the synchronized block.
There was a problem hiding this comment.
the {} should go on the previous line
There was a problem hiding this comment.
please add clarifying parentheses
There was a problem hiding this comment.
Actually, I're prefer to see two separate checks. One to ensure the keys and values are the same position count, and one the verifies the hash table size.
There was a problem hiding this comment.
If we were to introduce an interface for this class, what would the API look like for that interface? If it is only a small number of methods, we might want to add something like that to keep the abstractions between these classes simpler. I'm not saying we should actually add an interface here, yet; I'm just curious what it would look like if we did.
There was a problem hiding this comment.
@dain which classes are you considering to implement this interface? Did you just mean SingleMapBlock and AbstractSingleMapBlock?
There was a problem hiding this comment.
I mean the AbstractSingleMapBlock argument here. If that were an interface build specifically for this class, what methods would it have? If it is small, we might want to add one to clean up the code and simplify testing.... just a thought
There was a problem hiding this comment.
@dain Did you mean the mapBlock we passed in? It's AbstractMapBlock. The referenced methods in SingleMapBlock include getRawKeyBlock(), getRawValueBlock(), getHashTables(), and it also accesses keyNativeHashCode and keyBlockNativeEquals members. Do you think it's worth making a new interface? If yes I'll add getKeyNativeHashCode() and getKeyBlockNativeEquals() and make a 5 method interface in a separate commit.
There was a problem hiding this comment.
If it is only 5 methods, I would consider adding the interface. This is just my opinion... I generally get a bad feeling anytime I see a method taking a parameter with a type named Abstract*, as to me it screams, we should have an interface here. In this case I would name the interface MapBlockData. Again, just my opinion. You can ask others how they feel about this.
There was a problem hiding this comment.
@dain makes sense. Shall I send a new PR for this interface or a seperate commit in this same PR, or just in this commit?
There was a problem hiding this comment.
All of those options are fine with me. Maybe ask @haozhun what he prefers.
There was a problem hiding this comment.
The problem here is that it cannot be an interface because those fields are not public.
There was a problem hiding this comment.
Add comment: write to the field is protected by "this" monitor.
There was a problem hiding this comment.
The problem here is that it cannot be an interface because those fields are not public.
Presto builds hashtable for MapBlocks eagerly when constructing the MapBlock even it's not needed in the query. Building a hashtable could take up to 40% CPU of the scan cost on a map column. This commit defers the hashtable build to the time it's needed in SeekKey(). Note that we only do this to the MapBlock, not the MapBlockBuilder to avoid complex synchronization problems. The MapBlockBuilder will always build the hashtable. As the result MergingPageOutput and PartitionOutputOperator will still rebuild the hashtables when needed. The measurements shows there will be less than 10% pages for MergingPageOutput to build the hashtables. We will have a seperate PR to improve PartitionOutput and avoid rebuilding the pages so as to avoid hashtable rebuilding. Simple select checsum queries show over 40% CPU gain: Test | After | Before | Improvement select 2 map columns checksum | 11.69d | 20.06d | 42% Select 1 map column checksum | 9.67d | 17.73d | 45%
This reverts 1) commit ad05dcb. 2) commit 23de11f. PR prestodb#11791 (commit 23de11f and ad05dcb), which lazily builds the hashtables for maps, introduced a regression for the case where the MapBlock is created through AbstractMapBlock.getRegion(). The hashtables built on the MapBlock region were not updated in the original MapBlock, thus causing hashtables repeatedly being built on the same base MapBlock.
This reverts 1) commit ad05dcb. 2) commit 23de11f. PR #11791 (commit 23de11f and ad05dcb), which lazily builds the hashtables for maps, introduced a regression for the case where the MapBlock is created through AbstractMapBlock.getRegion(). The hashtables built on the MapBlock region were not updated in the original MapBlock, thus causing hashtables repeatedly being built on the same base MapBlock.
Fix for #11808
Presto builds hashtable for MapBlocks eagerly when constructing the
MapBlock even it's not needed in the query. Building a hashtable could
take up to 30% CPU of the scan cost on a map column. This commit defers
the hashtable build to the time it's needed in SeekKey(). Note that we
only do this to the MapBlock, not the MapBlockBuilder to avoid complex
synchronization problems. The MapBlockBuilder will always build the
hashtable. As the result MergingPageOutput and PartitionOutputOperator
will still rebuild the hashtables when needed. The measurements shows
there will be less than 10% pages for MergingPageOutput to build the
hashtables. We will have a seperate PR to improve PartitionOutput
and avoid rebuilding the pages so as to avoid hashtable rebuilding.
Simple select checsum queries show over 40% CPU gain: