- TurboPFor
- 100% C (C++ compatible headers), w/o inline assembly
- No other "Integer Compression" compress or decompress faster with better compression
- Direct Access is several times faster than other libraries
- Usage in C/C++ as easy as memcpy
- π Integrated (SIMD) differential/Zigzag encoding/decoding for sorted/unsorted integer lists
- β¨ Full range 16/32, π 64 bits integer lists and π Floating point
- π Java Critical Native Interface. Access TurboPFor incl. SIMD from Java as fast as calling from C.
- Compress better and faster than special binary compressors like blosc
+ **Features** - β¨ Scalar **"Variable Byte"** faster and more efficient than any other implementation
- β¨ **Novel** **"Variable Simple"** (incl. RLE) faster and more efficient than simple16 or simple8-b
- Scalar **"Bit Packing"** decoding as fast as SIMD-Packing in realistic (No "pure cache") scenarios - Bit Packing with **Direct/Random Access** without decompressing entire blocks - Access any single bit packed entry with **zero decompression** - β¨ **Direct Update** of individual bit packed entries - Reducing **Cache Pollution**
- π Fastest and most efficient **"SIMD Bit Packing"**
- π Fastest **"Elias Fano"** implementation w/ or w/o SIMD.
- **Novel** **"TurboPFor"** (Patched Frame-of-Reference,PFor/PForDelta) scheme with **direct access** or bulk decoding. Outstanding compression and speed. More efficient than **ANY** other "integer compression" scheme.
- π Scalar & SIMD Transform: Delta, Zigzag, Transpose/Shuffle, Floating point<->Integer
+ **Inverted Index ...do less, go fast!**
- Direct Access to compressed *frequency* and *position* data in inverted index with zero decompression
- β¨ **Novel** **"Intersection w/ skip intervals"**, decompress the minimum necessary blocks (~10-15%).
- π **Novel** Implicit skips with zero extra overhead
- π **Novel** Efficient **Bidirectional** Inverted Index Architecture (forward/backwards traversal).
- more than **2000! queries per second** on GOV2 dataset (25 millions documents) on a **SINGLE** core
- π Revolutionary Parallel Query Processing on Multicores w/ more than **7000!!! queries/sec** on a quad core PC.
**...forget** ~~Map Reduce, Hadoop, multi-node clusters,~~ ...
CPU: Sandy bridge i7-2600k at 4.2GHz, gcc 5.1, ubuntu 15.04, single thread.
- Realistic and practical benchmark with large integer arrays.
- No PURE cache benchmark
-
Generate and test skewed distribution (100.000.000 integers, Block size=128).
./icbench -a1.5 -m0 -M255 -n100m
Size | Ratio % | Bits/Integer | C Time MI/s | D Time MI/s | Function |
---|---|---|---|---|---|
63.392.801 | 15.85 | 5.07 | 388.36 | 1400.87 | TurboPFor |
63.392.801 | 15.85 | 5.07 | 365.26 | 246.93 | TurboPForDA |
65.359.916 | 16.34 | 5.23 | 7.09 | 638.96 | OptPFD |
72.364.024 | 18.09 | 5.79 | 85.31 | 762.00 | Simple16 |
78.514.276 | 19.63 | 6.28 | 251.34 | 841.61 | VSimple |
95.915.096 | 23.98 | 7.67 | 221.46 | 1049.70 | Simple-8b |
99.910.930 | 24.98 | 7.99 | 2603.47 | 1948.65 | TurboPackV |
99.910.930 | 24.98 | 7.99 | 2524.50 | 1943.41 | [SIMDPack FPF](#SIMDPack FPF) |
99.910.930 | 24.98 | 7.99 | 1883.21 | 1898.11 | TurboPack |
99.910.930 | 24.98 | 7.99 | 1877.25 | 935.83 | TurboForDA |
102.074.663 | 25.52 | 8.17 | 1621.64 | 1694.64 | TurboVbyte |
102.074.663 | 25.52 | 8.17 | 1214.12 | 1688.95 | MaskedVByte |
102.074.663 | 25.52 | 8.17 | 1178.72 | 949.59 | [Vbyte FPF](#Vbyte FPF) |
103.035.930 | 25.76 | 8.24 | 1480.47 | 1746.51 | libfor |
112.500.000 | 28.12 | 9.00 | 305.85 | 1899.15 | VarintG8IU |
400.000.000 | 100.00 | 32.00 | 1451.11 | 1493.46 | Copy |
N/A | N/A | EliasFano | |||
MI/s: 1.000.000 integers/second. 1000 MI/s = 4 GB/s |
|||||
#BOLD = pareto frontier. FPF=FastPFor |
|||||
TurboPForDA,TurboForDA: Direct Access is normally used when accessing individual values. |
-
gov2.sorted from [DocId data set](#DocId data set) Block size=128 (lz4+blosc+VSimple w/ 64Ki)
./icbench -c1 gov2.sorted
Size | Ratio % | Bits/Integer | C Time MI/s | D Time MI/s | Function |
---|---|---|---|---|---|
3.214.763.689 | 13.44 | 4.30 | 339.90 | 837.69 | VSimple 64Ki |
3.337.758.854 | 13.95 | 4.47 | 5.06 | 513.00 | OptPFD |
3.357.673.495 | 14.04 | 4.49 | 357.77 | 1192.14 | TurboPFor |
3.501.671.314 | 14.64 | 4.68 | 321.45 | 827.01 | VSimple |
3.766.174.764 | 15.75 | 5.04 | 617.88 | 712.31 | EliasFano |
3.820.190.182 | 15.97 | 5.11 | 118.81 | 650.21 | Simple16 |
3.958.888.197 | 16.55 | 5.30 | 279.19 | 618.60 | lz4+DT 64Ki |
4.521.326.518 | 18.90 | 6.05 | 209.17 | 824.26 | Simple-8b |
4.683.323.301 | 19.58 | 6.27 | 828.97 | 1007.44 | TurboVbyte |
4.953.768.342 | 20.71 | 6.63 | 1766.05 | 1943.87 | TurboPackV |
4.953.768.342 | 20.71 | 6.63 | 1419.35 | 1512.86 | TurboPack |
5.203.353.057 | 21.75 | 6.96 | 1560.34 | 1806.60 | SIMDPackD1 FPF |
6.074.995.117 | 25.40 | 8.13 | 494.70 | 729.97 | blosc_lz4 64Ki |
6.221.886.390 | 26.01 | 8.32 | 1666.76 | 1737.72 | TurboFor |
6.221.886.390 | 26.01 | 8.32 | 1660.52 | 565.25 | TurboForDA |
6.699.519.000 | 28.01 | 8.96 | 472.01 | 495.12 | Vbyte FPF |
6.700.989.563 | 28.02 | 8.96 | 728.72 | 991.57 | MaskedVByte |
7.622.896.878 | 31.87 | 10.20 | 208.73 | 1197.74 | VarintG8IU |
8.594.342.216 | 35.93 | 11.50 | 1307.22 | 1593.07 | libfor |
8.773.150.644 | 36.68 | 11.74 | 637.83 | 1301.05 | blosc_lz 64Ki |
23.918.861.764 | 100.00 | 32.00 | 1456.17 | 1480.78 | Copy |
Ki=1024 Integers. 64Ki = 256k bytes
"lz4+DT 64Ki" = Delta+Transpose from TurboPFor + lz4
"blosc_lz4" tested w/ lz4 compressor+vectorized shuffle
GOV2: 426GB, 25 Millions documents, average doc. size=18k.
-
Aol query log: 18.000 queries
~1300 queries per second (single core)
~5000 queries per second (quad core)
Ratio = 14.37% Decoded/Total Integers. -
TREC Million Query Track (1MQT):
~1100 queries per second (Single core)
~4500 queries per second (Quad core CPU)
Ratio = 11.59% Decoded/Total Integers.
- Benchmarking intersections (Single core, AOL query log)
max.docid/q | Time s | q/s | ms/q | % docid found |
---|---|---|---|---|
1.000 | 7.88 | 2283.1 | 0.438 | 81 |
10.000 | 10.54 | 1708.5 | 0.585 | 84 |
ALL | 13.96 | 1289.0 | 0.776 | 100 |
q/s: queries/second, ms/q:milliseconds/query |
- Benchmarking Parallel Query Processing (Quad core, AOL query log)
max.docid/q | Time s | q/s | ms/q | % docids found |
---|---|---|---|---|
1.000 | 2.66 | 6772.6 | 0.148 | 81 |
10.000 | 3.39 | 5307.5 | 0.188 | 84 |
ALL | 3.57 | 5036.5 | 0.199 | 100 |
- Search engines are spending 90% of the time in intersections when processing queries.
- Most search engines are using pruning strategies, caching popular queries,... to reduce the time for intersections and query processing.
- As indication, google is processing 40.000 Queries per seconds, using 900.000 multicore servers for searching 8 billions web pages (320 X size of GOV2).
- Recent GOV2 experiments (best paper at ECIR 2014) On Inverted Index Compression for Search Engine Efficiency using 8-core Xeon PC are reporting 1.2 seconds per query (for 1.000 Top-k docids).
make
-
test all functions
./icbench -a1.0 -m0 -M255 -n100m
-zipfian distribution alpha = 1.0 (Ex. -a1.0=uniform -a1.5=skewed distribution)
-number of integers = 100.000.000
-integer range from 0 to 255
-
individual function test (ex. Copy TurboPack TurboPFor)
./icbench -a1.5 -m0 -M255 -ecopy/turbopack/turbopfor -n100m
-
Data file Benchmark (file from [DocId data set](#DocId data set))
./icbench -c1 gov2.sorted
1 - Download Gov2 (or ClueWeb09) + query files (Ex. "1mq.txt") from [DocId data set](#DocId data set)
8GB RAM required (16GB recommended for benchmarking "clueweb09" files).
2 - Create index file
./idxcr gov2.sorted .
create inverted index file "gov2.sorted.i" in the current directory
3 - Test intersections
./idxqry gov2.sorted.i 1mq.txt
run queries in file "1mq.txt" over the index of gov2 file
1 - Create partitions
./idxseg gov2.sorted . -26m -s8
create 8 (CPU hardware threads) partitions for a total of ~26 millions document ids
2 - Create index file for each partition
./idxcr gov2.sorted.s*
create inverted index file for all partitions "gov2.sorted.s00 - gov2.sorted.s07" in the current directory
3 - Intersections:
delete "idxqry.o" file and then type "make para" to compile "idxqry" w. multithreading
./idxqry gov2.sorted.s*.i 1mq.txt
run queries in file "1mq.txt" over the index of all gov2 partitions "gov2.sorted.s00.i - gov2.sorted.s07.i".
See benchmark "icbench" program for usage examples. In general encoding/decoding functions are of the form:
char *endptr = encode( unsigned *in, unsigned n, char *out, [unsigned start], [int b])
endptr : set by encode to the next character in "out" after the encoded buffer
in : input integer array
n : number of elements
out : pointer to output buffer
b : number of bits. Only for bit packing functions
start : previous value. Only for integrated delta encoding functions
char *endptr = decode( char *in, unsigned n, unsigned *out, [unsigned start], [int b])
endptr : set by decode to the next character in "in" after the decoded buffer
in : pointer to input buffer
n : number of elements
out : output integer array
b : number of bits. Only for bit unpacking functions
start : previous value. Only for integrated delta decoding functions
header files to use with documentation:
header file | Functions |
---|---|
vint.h | variable byte |
vsimple.h | variable simple |
vp4dc.h, vp4dd.h | TurboPFor |
bitpack.h bitunpack.h | Bit Packing, For, +Direct Access |
eliasfano.h | Elias Fano |
- Linux: GNU GCC (>=4.6)
- clang (>=3.2)
- Windows: MinGW-w64 (no parallel query processing)
- All TurboPFor functions are thread safe
- FastPFor + Simdcomp: SIMDPack FPF, Vbyte FPF, VarintG8IU
- Optimized Pfor-delta compression code: OptPFD/OptP4, Simple16 (limited to 28 bits integers)
- MaskedVByte. See also: Vectorized VByte Decoding
- Index Compression Using 64-Bit Words: Simple-8b (speed optimized version tested)
- libfor
- lz4. included w. block size 64K as indication. Tested after preprocessing w. delta+transpose
- blosc. blosc is like transpose/shuffle+lz77. Tested blosc+lz4 and blosclz incl. vectorizeed shuffle.
see also benchmarks from the author of blosc single+multithreading - Document identifier data set
- Publications:
Last update: 8 JUL 2015