small hash #102

rurban · 2015-12-30T23:10:23Z

optimize hashes with <= 3-5 keys to a simple array of keys and values with linear lookup.

HvSMALL(hv) / XHvSMALL(xhv) is either checking HvMAX < 7, or a flag. If a flag the very first HE* entry needs to be a non-ptr tag (& 0x1).
We'd need a flag with inlined HEs and overlong keys, to omit HvSMALL optims with such long keys.
We cannot the hv_aux based HvFLAGS with normal HvSMALL hashes, esp. when inlined.

The best would be a he-array alike inlined len/char*/flags/val array to be cache concious. (as in #24 feature/gh24-he-array). The len really should be run-length encoded, then the flags needed for hash cmp need to come first.
However at first we start with simple HE* arrays. (array of ptrs, not values)
The last array element needs to have an NULL sentinel, so we cannot use all 7 HE*, only 6.

But there are many more simple hash optims, which we do first.

extract uncommon magical code from hv_common
add __builtin_ctz support (count trailing zeros) and use it instead of division on DO_HSPLIT (done with the builtin-ctz branch)
pre-extend hashes as in aassign with av_extend, when the number of keys is on the stack. This speeds up all the big hash inits (e.g. warnings.pm), needing no costly series of splits during initialization.

The text was updated successfully, but these errors were encountered:

wollmers · 2016-04-27T18:58:58Z

I estimate you can use linear or serial (unsorted) lookup up to 100 keys or even more, depending on benchmarks.

In my port of LCS::BV from Perl to C I began with Bob Jenkins hash and ended the tuning using VLAs (variable length arrays) on the stack, the array serially filled (\0 terminated). See llcs_seq_a() and the used hash_setpos() and hash_getpos(). With Bob Jenkins I get 250 kHz (cases per second) on i5@1500, with serial VLAs 7.5 MHz, thus factor 30x. The calloc variant llcs_seq() comes at 4 MHz.

Of course in my example I can benefit from the known restrictions: maximum size, keys strings immutable, typed values (uint_64).

rurban · 2016-04-27T22:31:26Z

So many? I thought I only want to fill one cache line, so just very few
keys. But I'll benchmark it soon, when I got more time. Other langs tested
3-5, if I remember.

On Wed, Apr 27, 2016, 20:59 Helmut Wollmersdorfer [email protected]
wrote:

I estimate you can use linear or serial (unsorted) lookup up to 100 keys
or even more, depending on benchmarks.

In my port of LCS::BV
https://github.com/wollmers/LCS-BV/tree/master/lib/LCS from Perl to C I
began with Bob Jenkins hash and ended the tuning using VLAs (variable
length arrays) on the stack, the array serially filled (\0 terminated). See
llcs_seq_a()
https://github.com/wollmers/c-lcs-bv/blob/master/lcstest.c#L105 and the
used hash_setpos() and hash_getpos(). With Bob Jenkins I get 250 kHz
(cases per second) on i5@1500, with serial VLAs 7.5 MHz, thus factor 30x.
The calloc variant llcs_seq()
https://github.com/wollmers/c-lcs-bv/blob/master/lcstest.c#L144 comes
at 4 MHz.

Of course in my example I can benefit from the known restrictions: maximum
size, keys strings immutable, typed values (uint_64).

—
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
#102 (comment)

wollmers · 2016-04-28T09:05:57Z

You should trust only numbers you benchmarked yourself;-)

Hash is said to have complexity O(1). But as always it is O(1*k), where k is the implementation factor.

Serial has O((n/2)*k). A break even point of n=4 between hash and serial would need k_hash = 2 * k_serial. I.e. the hash algorithm executes only the double amount of instructions compared to one iteration of the loop of serial. My serial has 3 instructions (C operators) in the loop including conditions. So for a break even n=4 it would need a hash function (locating the entry in the array) to only use 6 instructions.

I didn't optimize for cache friendlyness directly. Serial just maps a nearly indefinite (sparse) alphabet to a minimal one (none sparse) and keeps nearly the order of filling, which is memory and cache friendly. Hash algorithms (if not perfect hashes) map sparse to not so sparse, but still sparse.

rurban · 2016-07-17T17:05:34Z

I went with 7 because this is the initial calloced size. But it doesn't work yet, so I cannot benchmark it.

Calculate hashes on demand, but not store it in a HEK to make HEK shorter to fill more entries into a cache line. HEK_HASH(hek) is now invalid and gone. Use the new HeHASH_calc(he), HEK_HASH_calc(hek), SvSHARED_HASH_calc(sv) instead. See http://www.ilikebigbits.com/blog/2016/8/28/designing-a-fast-hash-table for benchmarks (HashCache). And using 4 tests in the hot hash loop also makes not much sense, when checking the length and the string is enough to weed out collisions. This strategy, recomputing the hash wehen needed, is so far 1-7% slower, but we hope to get to speed with the HeARRAY patch. See below. The endgoal is to get rid of linked lists and store the collisions inlined in consecutive memory, in a HekARRAY. (len,cmp-flags,char*,other-flags,val) Measurements in "Cache-Conscious Collision Resolution in String Hash Tables" by Nikolas Askitis and Justin Zobel, Melbourne 2005 show that this is the fastest strategy for Open Hashing (chained) tables. See GH #24 and GH #102 The next idea is to use MSB varint encoding of the str length in a HEK, because our strings are usually short, len < 63, fits into one byte. We can then merge it with the cmp-flags, the flags only needed for comparison. See https://techoverflow.net/blog/2013/01/25/efficiently-encoding-variable-length-integers-in-cc/ or just <63 one byte, >63 MSB: I32 len. Note that the 1st MSB bit is already taken for UTF8.

avoid hash calculation for a short number of keys. calloc the first 7 words of HvARRAY. if we add one to the 6th entry we need to split it, as the 7th, the last, is needed as NULL sentinel. on split a small hash, we need to allocate a fresh array to move the hashed entries to. This can be optimized furtheron. (alloc 2x) on insert a new entry at 7th, we can avoid a split when placeholders exist. just replace it then. See #102 WIP: the standard operations work, but use constant fails. Currently 13% slower.

Calculate hashes on demand, but not store it in a HEK to make HEK shorter to fill more entries into a cache line. HEK_HASH(hek) is now invalid and gone. Use the new HeHASH_calc(he), HEK_HASH_calc(hek), SvSHARED_HASH_calc(sv) instead. See http://www.ilikebigbits.com/blog/2016/8/28/designing-a-fast-hash-table for benchmarks (HashCache). And using 4 tests in the hot hash loop also makes not much sense, when checking the length and the string is enough to weed out collisions. This strategy, recomputing the hash wehen needed, is so far 1-7% slower, but we hope to get to speed with the HeARRAY patch. See below. The endgoal is to get rid of linked lists and store the collisions inlined in consecutive memory, in a HekARRAY. (len,cmp-flags,char*,other-flags,val) Measurements in "Cache-Conscious Collision Resolution in String Hash Tables" by Nikolas Askitis and Justin Zobel, Melbourne 2005 show that this is the fastest strategy for Open Hashing (chained) tables. See GH #24 and GH #102 The next idea is to use MSB varint encoding of the str length in a HEK, because our strings are usually short, len < 63, fits into one byte. We can then merge it with the cmp-flags, the flags only needed for comparison. See https://techoverflow.net/blog/2013/01/25/efficiently-encoding-variable-length-integers-in-cc/ or just <63 one byte, >63 MSB: I32 len. Note that the 1st MSB bit is already taken for UTF8.

avoid hash calculation for a short number of keys. calloc the first 7 words of HvARRAY. if we add one to the 6th entry we need to split it, as the 7th, the last, is needed as NULL sentinel. on split a small hash, we need to allocate a fresh array to move the hashed entries to. This can be optimized furtheron. (alloc 2x) on insert a new entry at 7th, we can avoid a split when placeholders exist. just replace it then. See #102 WIP: the standard operations work, but use constant fails. Currently 13% slower.

Calculate hashes on demand, but not store it in a HEK to make HEK shorter to fill more entries into a cache line. HEK_HASH(hek) is now invalid and gone. Use the new HeHASH_calc(he), HEK_HASH_calc(hek), SvSHARED_HASH_calc(sv) instead. See http://www.ilikebigbits.com/blog/2016/8/28/designing-a-fast-hash-table for benchmarks (HashCache). And using 4 tests in the hot hash loop also makes not much sense, when checking the length and the string is enough to weed out collisions. This strategy, recomputing the hash wehen needed, is so far 1-7% slower, but we hope to get to speed with the HeARRAY patch. See below. The endgoal is to get rid of linked lists and store the collisions inlined in consecutive memory, in a HekARRAY. (len,cmp-flags,char*,other-flags,val) Measurements in "Cache-Conscious Collision Resolution in String Hash Tables" by Nikolas Askitis and Justin Zobel, Melbourne 2005 show that this is the fastest strategy for Open Hashing (chained) tables. See GH #24 and GH #102 The next idea is to use MSB varint encoding of the str length in a HEK, because our strings are usually short, len < 63, fits into one byte. We can then merge it with the cmp-flags, the flags only needed for comparison. See https://techoverflow.net/blog/2013/01/25/efficiently-encoding-variable-length-integers-in-cc/ or just <63 one byte, >63 MSB: I32 len. Note that the 1st MSB bit is already taken for UTF8.

avoid hash calculation for a short number of keys. calloc the first 7 words of HvARRAY. if we add one to the 6th entry we need to split it, as the 7th, the last, is needed as NULL sentinel. on split a small hash, we need to allocate a fresh array to move the hashed entries to. This can be optimized furtheron. (alloc 2x) on insert a new entry at 7th, we can avoid a split when placeholders exist. just replace it then. See #102 WIP: the standard operations work, but use constant fails. Currently 13% slower.

Calculate hashes on demand, but not store it in a HEK to make HEK shorter to fill more entries into a cache line. HEK_HASH(hek) is now invalid and gone. Use the new HeHASH_calc(he), HEK_HASH_calc(hek), SvSHARED_HASH_calc(sv) instead. See http://www.ilikebigbits.com/blog/2016/8/28/designing-a-fast-hash-table for benchmarks (HashCache). And using 4 tests in the hot hash loop also makes not much sense, when checking the length and the string is enough to weed out collisions. This strategy, recomputing the hash wehen needed, is so far 1-7% slower, but we hope to get to speed with the HeARRAY patch. See below. The endgoal is to get rid of linked lists and store the collisions inlined in consecutive memory, in a HekARRAY. (len,cmp-flags,char*,other-flags,val) Measurements in "Cache-Conscious Collision Resolution in String Hash Tables" by Nikolas Askitis and Justin Zobel, Melbourne 2005 show that this is the fastest strategy for Open Hashing (chained) tables. See GH #24 and GH #102 The next idea is to use MSB varint encoding of the str length in a HEK, because our strings are usually short, len < 63, fits into one byte. We can then merge it with the cmp-flags, the flags only needed for comparison. See https://techoverflow.net/blog/2013/01/25/efficiently-encoding-variable-length-integers-in-cc/ or just <63 one byte, >63 MSB: I32 len. Note that the 1st MSB bit is already taken for UTF8.

rurban added the enhancement label Dec 30, 2015

rurban self-assigned this Dec 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small hash #102

small hash #102

rurban commented Dec 30, 2015 •

edited

Loading

wollmers commented Apr 27, 2016

rurban commented Apr 27, 2016

wollmers commented Apr 28, 2016

rurban commented Jul 17, 2016

small hash #102

small hash #102

Comments

rurban commented Dec 30, 2015 • edited Loading

wollmers commented Apr 27, 2016

rurban commented Apr 27, 2016

wollmers commented Apr 28, 2016

rurban commented Jul 17, 2016

rurban commented Dec 30, 2015 •

edited

Loading