Helper Utility for Generating Command Line Syntax
This project is a library for creating perfect hash tables from 32-bit key sets. It is based on the acyclic random 2-part hypergraph algorithm. Its primary goal is finding the fastest possible runtime solution.
It is geared toward offline table generation: a command line application is used
to generate a small C library that implements an Index()
routine, which, given
an input key, will return the order-preserved index of that key within the
original key set, e.g.:
uint32_t ix;
uint32_t key = 0x20190903;
ix = Index(key);
This allows the efficient implementation of key-value tables, e.g.:
extern uint32_t Table[];
uint32_t
Lookup(uint32_t Key)
{
return Table[Index(Key)]
};
void
Insert(uint32_t Key, uint32_t Value)
{
Table[Index(Key)] = Value;
}
void
Delete(uint32_t Key)
{
Table[Index(Key)] = 0;
}
The fastest Index()
routine is the MultiplyShiftRX
routine, which clocks in at about 6 cycles in practice, and boils down to
something like this:
extern uint32_t Assigned[];
uint32_t
Index(uint32_t Key)
{
uint32_t Vertex1;
uint32_t Vertex2;
Vertex1 = ((Key * Seed1) >> Shift);
Vertex2 = ((Key * Seed2) >> Shift);
return ((Assigned[Vertex1] + Assigned[Vertex2]) & IndexMask);
}
N.B. Seed1
, Seed2
, Shift
, and IndexMask
will all be literal
constants in the final source code, not variables.
This compiles down to:
lea r8, ptr [rip+0x23fc0]
imul edx, ecx, 0xe8d9cdf9
shr rdx, 0x10
movzx eax, word ptr [r8+rdx*2]
imul edx, ecx, 0xc2e3c0b7
shr rdx, 0x10
movzx ecx, word ptr [r8+rdx*2]
add eax, ecx
ret
The IACA profile reports 8 uops:
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - .\x64\Release\HologramWorld_31016_Chm01_MultiplyShiftRX_And.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 10.00 Cycles Throughput Bottleneck: Backend
Loop Count: 26
Port Binding In Cycles Per Iteration:
----------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
----------------------------------------------------------------------------
| Cycles | 1.3 0.0 | 2.0 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.3 | 1.3 | 0.0 |
----------------------------------------------------------------------------
| # Of | Ports pressure in cycles |
| Uops |0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------
| 1 | | | | | | 1.0 | | | lea r8, ptr [rip+0x23fc0]
| 1 | | 1.0 | | | | | | | imul edx, ecx, 0xe8d9cdf9
| 1 | 0.3 | | | | | | 0.7 | | shr rdx, 0x10
| 1 | | | 1.0 1.0 | | | | | | movzx eax, word ptr [r8+rdx*2]
| 1 | | 1.0 | | | | | | | imul edx, ecx, 0xc2e3c0b7
| 1 | 0.7 | | | | | | 0.3 | | shr rdx, 0x10
| 1 | | | | 1.0 1.0 | | | | | movzx ecx, word ptr [r8+rdx*2]
| 1 | 0.3 | | | | | 0.3 | 0.3 | | add eax, ecx
Total Num of Uops: 8
The "cost" behind the perfect hash table is the Assigned
array. The size of
this array will be the number of keys, rounded up to a power of two, and then
doubled. E.g. HologramWorld-31016.keys
has 31,016 keys. Rounded up to a
power of two is 32,768, then doubled: 65,336.
The data type used by the Assigned
array is the smallest C data type that
can hold the number of keys rounded up to a power of two. Thus, a 16-bit
unsigned short int
can be used for the HologramWorld-31016.keys
array:
unsigned short int Assigned[65336] = { ... };
Thus, sizeof(Assigned)
will be 131,072, or 128KB.
The Index()
routine will perform two memory lookups into this array per call.
No pointer chasing or indirection is required. The most frequent keys will have
both locations in L1 cache; the worst-case scenario is two memory lookups for
both locations for cold or infrequent keys.
Initially, all development on this project was done exclusively on Windows, with Visual Studio 2022. Linux support has recently been added, although it is still quite rough around the edges. For some context: about 1,700 hours have been spent on the Windows portion. The Linux support was hacked together in about two weeks of evening and weekend work.
The generated compiled perfect hash tables are cross-platform, and will work on Windows, Mac, Linux, x86, x64, and ARM64.
mkdir c:\src
cd src
git clone https://github.com/tpn/perfecthash
git clone https://github.com/tpn/perfecthash-keys
cd perfecthash/src
The PerfectHash.sln
file lives in perfecthash/src
. You can either build
this directly via Visual Studio, use one of the build-*.bat
files, or just
use msbuild
from a Visual Studio 2022 command prompt:
msbuild /nologo /m /t:Rebuild /p:Configuration=Release;Platform=x64
You can also download the latest binaries from the Releases
page. The PGO
zip files refer to profile-guided optimization builds, and are
generally faster than the Release
builds by up to 30-40%.
Once built or downloaded, there are two main command line executables:
PerfectHashCreate.exe
, and PerfectHashBulkCreate.exe
. The former is for
creating a single table, and it takes a single input key file. The latter
can be pointed at a directory of keys, and it will create tables for all of
them.
Prerequisites: C compiler (GCC 10 tested), CMake. Optional: Ninja.
mkdir -p ~/src && cd ~/src
git clone https://github.com/tpn/perfecthash
git clone https://github.com/tpn/perfecthash-keys
cd perfecthash/src
mkdir build && cd build
cmake -S .. -B . -G"Ninja Multi-Config"
ninja -F build-Release.ninja
ninja -F build-Debug.ninja
For normal Makefile support:
cd perfecthash/src
mkdir build && cd build
cmake -S .. -B . -G"Unix Makefiles"
make
I haven't tried to build it on Mac yet. It'll require a bit of hacking, I assume.
The usage options are almost identical for both programs. If you run either one without arguments, it will print detailed usage instructions, also available here.
The main usage follows:
PerfectHashBulkCreate.exe Usage:
<KeysDirectory> <OutputDirectory>
<Algorithm> <HashFunction> <MaskFunction>
<MaximumConcurrency>
[BulkCreateFlags] [KeysLoadFlags] [TableCreateFlags]
[TableCompileFlags] [TableCreateParameters]
PerfectHashCreate.exe Usage:
<KeysPath> <OutputDirectory>
<Algorithm> <HashFunction> <MaskFunction>
<MaximumConcurrency>
[CreateFlags] [KeysLoadFlags] [TableCreateFlags]
[TableCompileFlags] [TableCreateParameters]
Assuming you have built a Release
version of the library, from a Visual Studio
2022 command prompt (i.e. so the compiler is available in your PATH
):
mkdir c:\Temp\ph.out
cd c:\src\perfecthash\src
..\bin\timemem.exe x64\Release\PerfectHashCreate.exe c:\src\perfecthash\keys\HologramWorld-31016.keys c:\Temp\ph.out Chm01 MultiplyShiftR And 0 --Compile
On Linux this would look like:
mkdir -p ~/tmp/ph.out
cd ~/src/perfecthash/src
time ../x64/Release/PerfectHashCreateExe $HOME/src/perfecthash-keys/sys32/HologramWorld-31016.keys ~/tmp/ph.out Chm01 MultiplyShiftR And 0 --DisableCsvOutputFile
This should result in some output that looks like this:
c:\src\perfecthash\src>..\bin\timemem x64\Release\PerfectHashCreate.exe c:\src\perfecthash\keys\HologramWorld-31016.keys c:\Temp\ph.out Chm01 MultiplyShiftR And 0 --Compile
Keys File Name: HologramWorld-31016.keys
Number of Keys: 31016
Number of Table Resize Events: 0
Keys to Edges Ratio: 0.946533203125
Duration: 0 hours, 0 mins, 0 secs
Duration Since Last Best Graph:
Attempts: 8633
Attempts Per Second: 53956.250000000
Current Attempts: 8633
Current Attempts Per Second: 53956.250000000
Successful Attempts: 1
Failed Attempts: 8625
First Attempt Solved: 0
Most Recent Attempt Solved: 8562
Predicted Attempts to Solve: 8634
Predicted Attempts Remaining until next Solve: 8563
Estimated Seconds until next Solve: 0.15870265261206998
New Best Graph Count: 0
Equal Best Graph Count: 0
Solutions Found Ratio: 0.00011583458820803892
Vertex Collision Failures: 4294
Cyclic Graph Failures: 4331
Vertex Collision to Cyclic Graph Failure Ratio: 0.99145693835142
Highest Deleted Edges Count: 31008
[r] Refresh [f] Finish [e] Resize [c] Toggle Callback [?] More Help
.
Exit code : 0
Elapsed time : 2.89
Kernel time : 0.00 (0.0%)
User time : 2.97 (102.7%)
page fault # : 6504
Working set : 24164 KB
Paged pool : 167 KB
Non-paged pool : 52 KB
Page file size : 32280 KB
N.B. Console output isn't supported on Linux yet.
N.B. If you get an error like this, it means msbuild
couldn't be found on
your path; make sure to launch a Visual Studio 2022 command prompt:
C:\src\perfecthash\src\PerfectHash\PerfectHashTableCompile.c: 217: CreateProcessW failed with error: 2 (0x2). The system cannot find the file specified.
C:\src\perfecthash\src\PerfectHash\PerfectHashContextTableCreate.c: 492: PerfectHashTableCompile failed with error: 3758359076 (0xe0040224). System call failed.
If you look in C:\Temp\ph.out
, you'll see something along the following lines.
Intermediate build files have been omitted for brevity.
C:\Temp\ph.out>tree /f
Folder PATH listing for volume Windows
Volume serial number is 2490-4F03
C:.
│ CompiledPerfectHash.h
│ CompiledPerfectHash.props
│ CompiledPerfectHashMacroGlue.h
│ no_sal2.h
│ PerfectHashTableCreate_10A0ED40.csv
│
├───HologramWorld_31016_Chm01_MultiplyShiftR_And
│ HologramWorld_31016_Chm01_MultiplyShiftR_And.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And.def
│ HologramWorld_31016_Chm01_MultiplyShiftR_And.h
│ HologramWorld_31016_Chm01_MultiplyShiftR_And.pht1
│ HologramWorld_31016_Chm01_MultiplyShiftR_And.sln
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkFull.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkFull.mk
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkFullExe.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkFullExe.vcxproj
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkIndex.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkIndex.mk
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkIndexExe.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkIndexExe.vcxproj
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_Build.bat
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_Dll.vcxproj
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_Keys.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_Lib.mk
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_So.mk
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_StdAfx.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_StdAfx.h
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_Support.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_Support.h
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_TableData.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_TableValues.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_Test.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_Test.mk
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_TestExe.c
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_TestExe.vcxproj
│ HologramWorld_31016_Chm01_MultiplyShiftR_And_Types.h
│ main.mk
│ Makefile
│
└───x64
└───Release
BenchmarkFull_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe
BenchmarkFull_HologramWorld_31016_Chm01_MultiplyShiftR_And.pdb
BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe
BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.pdb
HologramWorld_31016_Chm01_MultiplyShiftR_And.dll
HologramWorld_31016_Chm01_MultiplyShiftR_And.lib
HologramWorld_31016_Chm01_MultiplyShiftR_And.pdb
Test_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe
Test_HologramWorld_31016_Chm01_MultiplyShiftR_And.pdb
The main library implementing the perfect hash table is the .dll
file. There
are three helper .exe
utilities also compiled for benchmarking and testing.
You can assess the performance of just the Index()
routine via:
BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe
:
cd /d c:\Temp\ph.out\x64\Release
c:\src\perfecthash\bin\timemem.exe BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe
On Windows, the output will look like this:
C:\Temp\ph.out\x64\Release>c:\src\perfecthash\bin\timemem.exe BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe
Exit code : 7106
Elapsed time : 0.01
Kernel time : 0.00 (0.0%)
User time : 0.00 (0.0%)
page fault # : 849
Working set : 3220 KB
Paged pool : 22 KB
Non-paged pool : 5 KB
Page file size : 520 KB
The exit code, 7106 in this case, is the minimum number of cycles it took, out
of 1000 attempts, to do 1000 calls to the Index()
routine. So, you can
divide by 1000 to get the approximate number of cycles per call, in this case,
about 7.
The BenchmarkFull
executable returns the minimum number of cycles it took, out
of 100 attempts, to do 10 iterations of the following:
- For each key, call
Insert(Key, RotateLeft(Key, 15))
. - For each key, call
Value = Lookup(Key)
. - For each key, call
Previous = Delete(Key)
.
C:\Temp\ph.out\x64\Release>c:\src\perfecthash\bin\timemem.exe BenchmarkFull_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe
Exit code : 7321346
Elapsed time : 0.28
Kernel time : 0.00 (0.0%)
User time : 0.16 (55.7%)
page fault # : 1036
Working set : 3832 KB
Paged pool : 31 KB
Non-paged pool : 5 KB
Page file size : 564 KB
The .dll
files are compiled with special IACA
versions of each Index()
routine (i.e. IndexIaca()
), so you can call iaca.exe
on them to get an
analysis of the generated code, e.g.:
C:\Temp\ph.out\x64\Release>c:\src\perfecthash\bin\iaca.exe HologramWorld_31016_Chm01_MultiplyShiftR_And.dll
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - HologramWorld_31016_Chm01_MultiplyShiftR_And.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 8.93 Cycles Throughput Bottleneck: Backend
Loop Count: 30
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | 1.0 | | | | | | | imul ecx, ecx, 0xff8d672d
| 1 | 1.0 | | | | | | | | shr rax, 0x9
| 1* | | | | | | | | | movzx edx, ax
| 1 | | | | | | | 1.0 | | shr rcx, 0xd
| 1 | | | 1.0 1.0 | | | | | | movzx eax, word ptr [r8+rdx*2]
| 1* | | | | | | | | | movzx edx, cx
| 1 | | | | 1.0 1.0 | | | | | movzx ecx, word ptr [r8+rdx*2]
| 1 | | | | | | 1.0 | | | add eax, ecx
Total Num Of Uops: 8
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.
Although note that this isn't an exact science, sometimes the compiler
reorders the IACA_VC_START()
and IACA_VC_END()
markers such that you end up
missing a couple of instructions in the analysis. In the example above, the
actual assembly for the Index()
routine is as follows:
C:\Temp\ph.out\x64\Release>dumpbin /disasm HologramWorld_31016_Chm01_MultiplyShiftR_And.dll
Microsoft (R) COFF/PE Dumper Version 14.34.31935.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file HologramWorld_31016_Chm01_MultiplyShiftR_And.dll
File Type: DLL
CompiledPerfectHash_HologramWorld_31016_Chm01_MultiplyShiftR_And_Index:
0000000180001000: 69 C1 DF 0E AD FF imul eax,ecx,0FFAD0EDFh
0000000180001006: 4C 8D 05 F3 3F 02 lea r8,[HologramWorld_31016_Chm01_MultiplyShiftR_And_TableData]
00
000000018000100D: 69 C9 2D 67 8D FF imul ecx,ecx,0FF8D672Dh
0000000180001013: 48 C1 E8 09 shr rax,9
0000000180001017: 0F B7 D0 movzx edx,ax
000000018000101A: 48 C1 E9 0D shr rcx,0Dh
000000018000101E: 41 0F B7 04 50 movzx eax,word ptr [r8+rdx*2]
0000000180001023: 0F B7 D1 movzx edx,cx
0000000180001026: 41 0F B7 0C 50 movzx ecx,word ptr [r8+rdx*2]
000000018000102B: 03 C1 add eax,ecx
000000018000102D: 25 FF 7F 00 00 and eax,7FFFh
0000000180001032: C3 ret
To compile the hash table on Linux (using WSL1 and GCC 9 as an example):
% cd /mnt/c/Temp/ph.out/HologramWorld_31016_Chm01_MultiplyShiftR_And
% make
% export LD_LIBRARY_PATH=.
% ./BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And
8094
With clang (version 10):
% make clean
% CC=clang make
% ./BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And
7068
Identical to Linux, except you don't need export LD_LIBRARY_PATH=.
:
% make
% ./BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And