Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Dec 4, 2020

This PR shows one area for improvement in the hash join and aggregates. Currently the key is hashed twice by first looking up the key, and then inserting (which hashes the same key again) or mutating the value. This particularly is expensive with lots of distinct keys (ie most joins or aggregates on some high cardinality identifier).
Using the unstable hash_raw_entry (or api in hashbrown) api we can avoid this, and get some speedup somewhere between 50-100ms (mostly in the hash join) for the tpch query 12.

We could also use the hashbrown crate (which rust std lib also uses) instead to avoid needing a unstable feature. did that

This brings the query tpc-h 12 times down from > 1500ms locally to:

Query 12 iteration 0 took 1399 ms
Query 12 iteration 1 took 1398 ms
Query 12 iteration 2 took 1402 ms
Query 12 iteration 3 took 1407 ms
Query 12 iteration 4 took 1409 ms
Query 12 iteration 5 took 1406 ms
Query 12 iteration 6 took 1414 ms
Query 12 iteration 7 took 1416 ms
Query 12 iteration 8 took 1423 ms
Query 12 iteration 9 took 1430 ms

FYI @jorgecarleitao @andygrove

@github-actions
Copy link

github-actions bot commented Dec 4, 2020

@Dandandan
Copy link
Contributor Author

Is ready for review now @jorgecarleitao @andygrove

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Another cool improvement! :)

@jorgecarleitao
Copy link
Member

@alamb @andygrove , this introduces a new dependency to DataFusion. Is that ok for you?

@Dandandan
Copy link
Contributor Author

Some additional context: in the future, when the feature is stabilized, the hashbrown dependency can be dropped again. I think the raw entry api will be useful for future optimizations / hash join algorithms as well, for example it also allows for putting your own keys instead of based on a value.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me -- thank you @Dandandan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants