Introduce simplified string dictionary #615

alexbaden · 2023-08-03T15:18:33Z

This PR introduces a drop-in replacement for the StringDictionary class using STL containers under the namespace fast. The previous string dictionary implementation remains under the namespace legacy. The "fast" dictionary has about the same performance of the previous dictionary, but includes options like materialized hashes on by default (materialized hashes give a small performance penalty for small dictionaries in the legacy implementation, because the cost of materializing and storing the hash outweighs the benefit of skipping string comparisons when there are lots of collisions in the hash table). Materialized hashing is essentially free in the fast implementation because we store the hashes directly next to the string ID in the hash table payload, and both are 4 byte values.

Currently fast dictionary passes all StringDict tests except for those involved in dictionary translation. I intend to push the performance boundary of the fast dictionary first, then enable all tests, and finally remove the legacy dictionary and add APIs which should give more performance - e.g. where string ownership remains outside of the dictionary. Also, this PR depends on #613 and #614 which should be merged first.

alexbaden · 2023-08-08T14:47:39Z

This is now ready for review - performance is on par with the legacy string dictionary, but materialized hashing is enabled always, so performance should be better on bigger scale or more dense datasets (as compared to taxi).

ienkovich

Overall looks good! I have a couple of minor threading concerns though.

omniscidb/StringDictionary/StringDictionary.cpp

ienkovich · 2023-08-14T18:35:28Z

omniscidb/StringDictionary/StringDictionary.cpp

+
+  // compute hashes
+  auto hashes = std::make_unique<uint32_t[]>(string_vec.size());
+  tbb::parallel_for(tbb::blocked_range<size_t>(0, string_vec.size()),


The legacy version used a grain size to avoid very small tasks. I think it's reasonable to put some limits here. We don't want to create a task per each string.

Thanks for the tips! I experimented with a few different options following this guide and something in the range of 25k seemed to be the best, though not by very much. It's possible taxi is too noisy and we need a dedicated microbenchmark to tune this. I think 25k should be fine for now, though.

Code assumes 4-byte hashes, so the typedef is useless.

alexbaden requested review from kurapov-peter and ienkovich August 3, 2023 15:18

alexbaden force-pushed the alex/fast_string_dict branch 2 times, most recently from c8db2df to 80488ac Compare August 8, 2023 14:46

alexbaden marked this pull request as ready for review August 8, 2023 14:46

ienkovich approved these changes Aug 14, 2023

View reviewed changes

alexbaden added 4 commits August 14, 2023 15:03

Remove string dict typedef

0e5ad09

Code assumes 4-byte hashes, so the typedef is useless.

Move string dictionary translation to dedicated class

3ad364c

Init logger in StringDictionaryTest

f32fbfa

Introduce simplified string dictionary

0c446a7

alexbaden force-pushed the alex/fast_string_dict branch from 80488ac to c407f86 Compare August 14, 2023 23:03

Parallelism fixups for simplified string dictionary

c407f86

alexbaden merged commit f9b4c07 into main Aug 15, 2023

alexbaden deleted the alex/fast_string_dict branch August 15, 2023 04:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce simplified string dictionary #615

Introduce simplified string dictionary #615

alexbaden commented Aug 3, 2023

alexbaden commented Aug 8, 2023

ienkovich left a comment

ienkovich Aug 14, 2023

alexbaden Aug 14, 2023

Introduce simplified string dictionary #615

Introduce simplified string dictionary #615

Conversation

alexbaden commented Aug 3, 2023

alexbaden commented Aug 8, 2023

ienkovich left a comment

Choose a reason for hiding this comment

ienkovich Aug 14, 2023

Choose a reason for hiding this comment

alexbaden Aug 14, 2023

Choose a reason for hiding this comment