Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ORC-1356: [C++] Support RLEv2 bit-unpacking to leverage Intel AVX-512…
… instructions ### What changes were proposed in this pull request? In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing. In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value. Intel AVX512 instructions official link: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html 1. Added cmake option named "BUILD_ENABLE_AVX512", to switch this feature enable or not in the building process. The default value of BUILD_ENABLE_AVX512 is OFF. For example, cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON This will build ORC library with AVX512 Bit-unpacking enabling. 2. Added macro "ORC_HAVE_RUNTIME_AVX512" to enable this feature code build or not in ORC. 3. Added the file "CpuInfoUtil.cc" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode. 4. Added the functions "vectorUnpackX" to support X-bit value decode instead of the original function plainUnpackLongs or vectorUnpackX 5. Added the testcases "RleV2BitUnpackAvx512Test" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc. 6. Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking. 7. AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one. ### Why are the changes needed? This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times. ### How to enable AVX512 Bit-unpacking? 1. Enable the cmake option BUILD_ENABLE_AVX512, it will build ORC library with AVX512 enabling. cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON 2. Set the ENV parameter when using ORC library export ORC_USER_SIMD_LEVEL=AVX512 (Note: This parameter has only 2 values "AVX512" && "none", the value has no case-sensitive) If set ORC_USER_SIMD_LEVEL=none, AVX512 Bit-unpacking will be disabled. ### How was this patch tested? I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DBUILD_ENABLE_AVX512=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios: 1. The blockSize increases from 1 to 10000, and data length is 10240; 2. The blockSize increases from 1000 to 10000, and data length increases from 1000 to 70000 The testcase will be executed for a while, so I added a progress bar for every testcase. Here is a progress bar demo print of one testcase: [ RUN ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1 10bit Test 1st Part:[OK][#################################################################################][100%] 10bit Test 2nd Part:[OK][#################################################################################][100%] To the main vector function vectorUnpackX, the test code coverage up to 100%. This closes apache#1375
- Loading branch information