ORC-1356: [C++] Support RLEv2 bit-unpacking to leverage Intel AVX-512…

… instructions ### What changes were proposed in this pull request? In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing. In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value. Intel AVX512 instructions official link: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html 1. Added cmake option named "BUILD_ENABLE_AVX512", to switch this feature enable or not in the building process. The default value of BUILD_ENABLE_AVX512 is OFF. For example, cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON This will build ORC library with AVX512 Bit-unpacking enabling. 2. Added macro "ORC_HAVE_RUNTIME_AVX512" to enable this feature code build or not in ORC. 3. Added the file "CpuInfoUtil.cc" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode. 4. Added the functions "vectorUnpackX" to support X-bit value decode instead of the original function plainUnpackLongs or vectorUnpackX 5. Added the testcases "RleV2BitUnpackAvx512Test" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc. 6. Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking. 7. AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one. ### Why are the changes needed? This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times. ### How to enable AVX512 Bit-unpacking? 1. Enable the cmake option BUILD_ENABLE_AVX512, it will build ORC library with AVX512 enabling. cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON 2. Set the ENV parameter when using ORC library export ORC_USER_SIMD_LEVEL=AVX512 (Note: This parameter has only 2 values "AVX512" && "none", the value has no case-sensitive) If set ORC_USER_SIMD_LEVEL=none, AVX512 Bit-unpacking will be disabled. ### How was this patch tested? I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DBUILD_ENABLE_AVX512=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios: 1. The blockSize increases from 1 to 10000, and data length is 10240; 2. The blockSize increases from 1000 to 10000, and data length increases from 1000 to 70000 The testcase will be executed for a while, so I added a progress bar for every testcase. Here is a progress bar demo print of one testcase: [ RUN ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1 10bit Test 1st Part:[OK][#################################################################################][100%] 10bit Test 2nd Part:[OK][#################################################################################][100%] To the main vector function vectorUnpackX, the test code coverage up to 100%. This closes apache#1375
cxzl25 · Jan 11, 2024 · fea966e · fea966e
1 parent 02dbe7c
commit fea966e
Show file tree

Hide file tree

Showing 18 changed files with 5,291 additions and 352 deletions.
diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml
@@ -79,8 +79,16 @@ jobs:
         cat /home/runner/work/orc/orc/build/java/rat.txt
 
   windows:
-    name: "Build on Windows"
+    name: "C++ ${{ matrix.simd }} Test on Windows"
     runs-on: windows-2019
+    strategy:
+      fail-fast: false
+      matrix:
+        simd:
+          - General
+          - AVX512
+    env:
+      ORC_USER_SIMD_LEVEL: AVX512
     steps:
     - name: Checkout
       uses: actions/checkout@v2
@@ -89,13 +97,41 @@ jobs:
       with:
         msbuild-architecture: x64
     - name: "Test"
+      shell: bash
       run: |
         mkdir build
         cd build
-        cmake .. -G "Visual Studio 16 2019" -DCMAKE_BUILD_TYPE=Debug -DBUILD_LIBHDFSPP=OFF -DBUILD_TOOLS=OFF -DBUILD_JAVA=OFF
+        if [ "${{ matrix.simd }}" = "General" ]; then
+          cmake .. -G "Visual Studio 16 2019" -DCMAKE_BUILD_TYPE=Debug -DBUILD_LIBHDFSPP=OFF -DBUILD_TOOLS=OFF -DBUILD_JAVA=OFF
+        else
+          cmake .. -G "Visual Studio 16 2019" -DCMAKE_BUILD_TYPE=Debug -DBUILD_LIBHDFSPP=OFF -DBUILD_TOOLS=OFF -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON
+        fi
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-22.04
+        cxx:
+          - clang++
+    env:
+      ORC_USER_SIMD_LEVEL: AVX512
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v2
+    - name: "Test"
+      run: |
+        mkdir -p ~/.m2
+        mkdir build
+        cd build
+        cmake -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON ..
+        make package test-out
+
   doc:
     name: "Javadoc generation"
     runs-on: ubuntu-20.04

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -72,6 +72,10 @@ option(BUILD_CPP_ENABLE_METRICS
     "Enable the metrics collection at compile phase"
     OFF)
 
+option(BUILD_ENABLE_AVX512
+    "Enable build with AVX512 at compile time"
+    OFF)
+
 # Make sure that a build type is selected
 if (NOT CMAKE_BUILD_TYPE)
   message(STATUS "No build type selected, default to ReleaseWithDebugInfo")
@@ -121,7 +125,7 @@ if (CMAKE_CXX_COMPILER_ID MATCHES "Clang")
   set (WARN_FLAGS "${WARN_FLAGS} -Wno-covered-switch-default")
   set (WARN_FLAGS "${WARN_FLAGS} -Wno-missing-noreturn -Wno-unknown-pragmas")
   set (WARN_FLAGS "${WARN_FLAGS} -Wno-gnu-zero-variadic-macro-arguments")
-  set (WARN_FLAGS "${WARN_FLAGS} -Wconversion")
+  set (WARN_FLAGS "${WARN_FLAGS} -Wno-conversion")
   if (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL "13.0")
     set (WARN_FLAGS "${WARN_FLAGS} -Wno-reserved-identifier")
   endif()
@@ -140,7 +144,7 @@ elseif (CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
   else ()
     set (CXX17_FLAGS "-std=c++17")
   endif ()
-  set (WARN_FLAGS "-Wall -Wno-unknown-pragmas -Wconversion")
+  set (WARN_FLAGS "-Wall -Wno-unknown-pragmas -Wno-conversion")
   if (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER "12.0")
     set (WARN_FLAGS "${WARN_FLAGS} -Wno-array-bounds -Wno-stringop-overread") # To compile protobuf in Fedora37
   endif ()
@@ -174,6 +178,15 @@ enable_testing()
 
 INCLUDE(CheckSourceCompiles)
 INCLUDE(ThirdpartyToolchain)
+message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+#
+# macOS doesn't fully support AVX512, it has a different way dealing with AVX512 than Windows and Linux.
+#
+# Here can find the description:
+# https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/i386/fpu.c#L174
+if (BUILD_ENABLE_AVX512 AND NOT APPLE)
+  INCLUDE(ConfigSimdLevel)
+endif ()
 
 set (EXAMPLE_DIRECTORY ${CMAKE_SOURCE_DIR}/examples)
 

diff --git a/README.md b/README.md
@@ -93,3 +93,18 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabled:
+```shell
+export ORC_USER_SIMD_LEVEL=AVX512
+% mkdir build
+% cd build
+% cmake .. -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON
+% make package
+% make test-out
+```
+Cmake option BUILD_ENABLE_AVX512 can be set to "ON" or (default value)"OFF" at the compile time. At compile time, it defines the SIMD level(AVX512) to be compiled into the binaries.
+
+Environment variable ORC_USER_SIMD_LEVEL can be set to "AVX512" or (default value)"NONE" at the run time. At run time, it defines the SIMD level to dispatch the code which can apply SIMD optimization.
+
+Note that if ORC_USER_SIMD_LEVEL is set to "NONE" at run time, AVX512 will not take effect at run time even if BUILD_ENABLE_AVX512 is set to "ON" at compile time.
diff --git a/c++/src/BitUnpackerAvx512.hh b/c++/src/BitUnpackerAvx512.hh
diff --git a/c++/src/Bpacking.hh b/c++/src/Bpacking.hh
@@ -0,0 +1,34 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKING_HH
+#define ORC_BPACKING_HH
+
+#include <cstdint>
+
+namespace orc {
+  class RleDecoderV2;
+
+  class BitUnpack {
+   public:
+    static void readLongs(RleDecoderV2* decoder, int64_t* data, uint64_t offset, uint64_t len,
+                          uint64_t fbs);
+  };
+}  // namespace orc
+
+#endif