-
Notifications
You must be signed in to change notification settings - Fork 14
Introduce simplified string dictionary #615
Conversation
c8db2df
to
80488ac
Compare
This is now ready for review - performance is on par with the legacy string dictionary, but materialized hashing is enabled always, so performance should be better on bigger scale or more dense datasets (as compared to taxi). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! I have a couple of minor threading concerns though.
|
||
// compute hashes | ||
auto hashes = std::make_unique<uint32_t[]>(string_vec.size()); | ||
tbb::parallel_for(tbb::blocked_range<size_t>(0, string_vec.size()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legacy version used a grain size to avoid very small tasks. I think it's reasonable to put some limits here. We don't want to create a task per each string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tips! I experimented with a few different options following this guide and something in the range of 25k seemed to be the best, though not by very much. It's possible taxi is too noisy and we need a dedicated microbenchmark to tune this. I think 25k should be fine for now, though.
Code assumes 4-byte hashes, so the typedef is useless.
80488ac
to
c407f86
Compare
This PR introduces a drop-in replacement for the
StringDictionary
class using STL containers under the namespacefast
. The previous string dictionary implementation remains under the namespacelegacy
. The "fast" dictionary has about the same performance of the previous dictionary, but includes options like materialized hashes on by default (materialized hashes give a small performance penalty for small dictionaries in the legacy implementation, because the cost of materializing and storing the hash outweighs the benefit of skipping string comparisons when there are lots of collisions in the hash table). Materialized hashing is essentially free in the fast implementation because we store the hashes directly next to the string ID in the hash table payload, and both are 4 byte values.Currently fast dictionary passes all StringDict tests except for those involved in dictionary translation. I intend to push the performance boundary of the fast dictionary first, then enable all tests, and finally remove the legacy dictionary and add APIs which should give more performance - e.g. where string ownership remains outside of the dictionary. Also, this PR depends on #613 and #614 which should be merged first.