Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

Introduce simplified string dictionary #615

Merged
merged 5 commits into from
Aug 15, 2023
Merged

Conversation

alexbaden
Copy link
Contributor

This PR introduces a drop-in replacement for the StringDictionary class using STL containers under the namespace fast. The previous string dictionary implementation remains under the namespace legacy. The "fast" dictionary has about the same performance of the previous dictionary, but includes options like materialized hashes on by default (materialized hashes give a small performance penalty for small dictionaries in the legacy implementation, because the cost of materializing and storing the hash outweighs the benefit of skipping string comparisons when there are lots of collisions in the hash table). Materialized hashing is essentially free in the fast implementation because we store the hashes directly next to the string ID in the hash table payload, and both are 4 byte values.

Currently fast dictionary passes all StringDict tests except for those involved in dictionary translation. I intend to push the performance boundary of the fast dictionary first, then enable all tests, and finally remove the legacy dictionary and add APIs which should give more performance - e.g. where string ownership remains outside of the dictionary. Also, this PR depends on #613 and #614 which should be merged first.

@alexbaden alexbaden force-pushed the alex/fast_string_dict branch 2 times, most recently from c8db2df to 80488ac Compare August 8, 2023 14:46
@alexbaden alexbaden marked this pull request as ready for review August 8, 2023 14:46
@alexbaden
Copy link
Contributor Author

This is now ready for review - performance is on par with the legacy string dictionary, but materialized hashing is enabled always, so performance should be better on bigger scale or more dense datasets (as compared to taxi).

Copy link
Contributor

@ienkovich ienkovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! I have a couple of minor threading concerns though.


// compute hashes
auto hashes = std::make_unique<uint32_t[]>(string_vec.size());
tbb::parallel_for(tbb::blocked_range<size_t>(0, string_vec.size()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy version used a grain size to avoid very small tasks. I think it's reasonable to put some limits here. We don't want to create a task per each string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tips! I experimented with a few different options following this guide and something in the range of 25k seemed to be the best, though not by very much. It's possible taxi is too noisy and we need a dedicated microbenchmark to tune this. I think 25k should be fine for now, though.

@alexbaden alexbaden force-pushed the alex/fast_string_dict branch from 80488ac to c407f86 Compare August 14, 2023 23:03
@alexbaden alexbaden merged commit f9b4c07 into main Aug 15, 2023
@alexbaden alexbaden deleted the alex/fast_string_dict branch August 15, 2023 04:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants