Skip to content

Lazily build hashtable for MapBlock#11791

Merged
yingsu00 merged 1 commit intoprestodb:masterfrom
yingsu00:lazyMapHT
Nov 8, 2018
Merged

Lazily build hashtable for MapBlock#11791
yingsu00 merged 1 commit intoprestodb:masterfrom
yingsu00:lazyMapHT

Conversation

@yingsu00
Copy link
Contributor

@yingsu00 yingsu00 commented Oct 26, 2018

Fix for #11808
Presto builds hashtable for MapBlocks eagerly when constructing the
MapBlock even it's not needed in the query. Building a hashtable could
take up to 30% CPU of the scan cost on a map column. This commit defers
the hashtable build to the time it's needed in SeekKey(). Note that we
only do this to the MapBlock, not the MapBlockBuilder to avoid complex
synchronization problems. The MapBlockBuilder will always build the
hashtable. As the result MergingPageOutput and PartitionOutputOperator
will still rebuild the hashtables when needed. The measurements shows
there will be less than 10% pages for MergingPageOutput to build the
hashtables. We will have a seperate PR to improve PartitionOutput
and avoid rebuilding the pages so as to avoid hashtable rebuilding.

Simple select checsum queries show over 40% CPU gain:

Test                          | After  | Before | Improvement
select 2 map columns checksum | 11.69d | 20.06d | 42%
Select 1 map column checksum  |  9.67d | 17.73d | 45%

Copy link
Contributor

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just skimmed. I didn't intend to review this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add requireNonNull (or explanatory comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please include rationale

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not access this.hashTables directly? you do this when you complete computation anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.hashTables

Copy link
Contributor

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of changes here, many of which seems to be refactorings. Can you pull those into separate commits so that it’s esiser to review and see the real change?

@yingsu00
Copy link
Contributor Author

yingsu00 commented Nov 1, 2018

@electrum Hi David, I have removed the formatting changes (breaking long lines). Now the changes should be all related to the logic change. Please let me know if this is what you want, thanks!

@dain dain self-assigned this Nov 1, 2018
Copy link
Contributor

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments/suggestions from my first read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invert this if condition to remove a level of nesting. Something like this:

if (this.hashTables != null) {
    return this;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, invert this one, and move to right after the start of the synchronized block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the {} should go on the previous line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add clarifying parentheses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I're prefer to see two separate checks. One to ensure the keys and values are the same position count, and one the verifies the hash table size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to introduce an interface for this class, what would the API look like for that interface? If it is only a small number of methods, we might want to add something like that to keep the abstractions between these classes simpler. I'm not saying we should actually add an interface here, yet; I'm just curious what it would look like if we did.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain which classes are you considering to implement this interface? Did you just mean SingleMapBlock and AbstractSingleMapBlock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the AbstractSingleMapBlock argument here. If that were an interface build specifically for this class, what methods would it have? If it is small, we might want to add one to clean up the code and simplify testing.... just a thought

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain Did you mean the mapBlock we passed in? It's AbstractMapBlock. The referenced methods in SingleMapBlock include getRawKeyBlock(), getRawValueBlock(), getHashTables(), and it also accesses keyNativeHashCode and keyBlockNativeEquals members. Do you think it's worth making a new interface? If yes I'll add getKeyNativeHashCode() and getKeyBlockNativeEquals() and make a 5 method interface in a separate commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is only 5 methods, I would consider adding the interface. This is just my opinion... I generally get a bad feeling anytime I see a method taking a parameter with a type named Abstract*, as to me it screams, we should have an interface here. In this case I would name the interface MapBlockData. Again, just my opinion. You can ask others how they feel about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain makes sense. Shall I send a new PR for this interface or a seperate commit in this same PR, or just in this commit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those options are fine with me. Maybe ask @haozhun what he prefers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that it cannot be an interface because those fields are not public.

Copy link
Contributor

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me. @haozhun, did you want to review this?

@dain dain removed their assignment Nov 7, 2018
Copy link
Contributor

@haozhun haozhun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment: write to the field is protected by "this" monitor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that it cannot be an interface because those fields are not public.

Presto builds hashtable for MapBlocks eagerly when constructing the
MapBlock even it's not needed in the query. Building a hashtable could
take up to 40% CPU of the scan cost on a map column. This commit defers
the hashtable build to the time it's needed in SeekKey(). Note that we
only do this to the MapBlock, not the MapBlockBuilder to avoid complex
synchronization problems. The MapBlockBuilder will always build the
hashtable. As the result MergingPageOutput and PartitionOutputOperator
will still rebuild the hashtables when needed. The measurements shows
there will be less than 10% pages for MergingPageOutput to build the
hashtables. We will have a seperate PR to improve PartitionOutput
and avoid rebuilding the pages so as to avoid hashtable rebuilding.

Simple select checsum queries show over 40% CPU gain:
Test                          | After  | Before | Improvement
select 2 map columns checksum | 11.69d | 20.06d | 42%
Select 1 map column checksum  |  9.67d | 17.73d | 45%
@yingsu00 yingsu00 merged commit 23de11f into prestodb:master Nov 8, 2018
yingsu00 pushed a commit to yingsu00/presto that referenced this pull request Jan 8, 2019
This reverts
1) commit ad05dcb.
2) commit 23de11f.

PR prestodb#11791 (commit 23de11f and ad05dcb), which lazily builds the
hashtables for maps, introduced a regression for the case where the
MapBlock is created through AbstractMapBlock.getRegion(). The hashtables
built on the MapBlock region were not updated in the original MapBlock,
thus causing hashtables repeatedly being built on the same base MapBlock.
yingsu00 pushed a commit that referenced this pull request Jan 9, 2019
This reverts
1) commit ad05dcb.
2) commit 23de11f.

PR #11791 (commit 23de11f and ad05dcb), which lazily builds the
hashtables for maps, introduced a regression for the case where the
MapBlock is created through AbstractMapBlock.getRegion(). The hashtables
built on the MapBlock region were not updated in the original MapBlock,
thus causing hashtables repeatedly being built on the same base MapBlock.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants