cpu: ppc64: add gemm and reorder kernels by Tiwari-Avanish · Pull Request #3156 · uxlfoundation/oneDNN

Tiwari-Avanish · 2025-04-24T03:46:35Z

Description

Implented reorder for fp32 to u8 for matmul.
Implemented prepacking routine for input and output that will support u8 to s8 conversion as well, based on the data type of input.
Implemented gemm driver to run parallely the gemm kernel based on the block size of matrix A and B.
Improved the performance of gemm kernel.

spalicki

For the issues with our CI:

You need to change commit message to comply with <scope>:[ <scope>:] <description>, e.g. in your case: cpu: ppc64: add gemm and reorder kernels
Format using clang-format; you can see how it is used in .github/automation/clang-format.sh. Required changes will be applied in-place (the -i in the command), so you do not have to manually fix everything.
Header guards I have marked.

src/cpu/ppc64/ppc64_gemm_reorder.hpp

src/cpu/ppc64/gemm/gemm_driver.hpp

spalicki · 2025-05-05T22:37:42Z

src/CMakeLists.txt


 if(DNNL_EXPERIMENTAL_UKERNEL)
-    if(DNNL_TARGET_ARCH STREQUAL "X64" OR DNNL_TARGET_ARCH STREQUAL "AARCH64")
+    if(DNNL_TARGET_ARCH STREQUAL "X64" OR DNNL_TARGET_ARCH STREQUAL "AARCH64" OR DNNL_TARGET_ARCH STREQUAL "PPC64")


Is the experimental uKernel API used somewhere in this patch? Or you want to add it for integration with other libraries?

Without this change i was not able to build onednn on power system, so, i have added this line here.
I was running onednn by enabling with VLLM and pytorch but was not able to build, so with this changes i was able to build pytorch and vllm.

We want to use brgemm approach for the next phase, regarding that i have gone through the onednn document. But info i have found that is with jitted code. We don't have right now xbyak like jit assembler so,can we use c++ static routine with power Vector intrinsic code.

Unresolving this one.
It seems that the change that must be done is on PyTorch side instead - if the target platform is PPC64, DNNL_EXPERIMENTAL_UKERNEL must be turned off on their side of enabling oneDNN.
This macro is responsible to enable API and might lead to misleading expectations.

src/cpu/gemm/gemm.cpp

src/cpu/matmul/gemm_x8s8s32x_matmul.cpp

Tiwari-Avanish · 2025-05-18T18:27:22Z

Hi @spalicki
It took time because i was seeing some test case failure in test_gemm_u8s8s32 related to alpha and beta when it was in floating point, now that got fixed and i have added the changes into this PR.

As you have asked, i have collected the performance with onednn benchdnn.

./matmul_perf_cpp:

With the changes:
2.20503 TOp/s
Example passed on CPU. 

Without this changes:
1.5468 TOp/s
Example passed on CPU 

Other benchdnn i have run:

With the PR changes:

./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 8192x8192:8192x8192

0:PASSED (3828 ms) __REPRO: --matmul --dt=u8:s8:u8 8192x8192:8192x8192
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 3.83s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 1.72s (45%); execute: 0.36s (9%); compute_ref: 1.36s (35%); compare: 0.38s (10%);

./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 4096x4096:4096x4096

total: 0.75s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 0.39s (52%); execute: 0.05s (7%); compute_ref: 0.21s (27%); compare: 0.09s (12%);

Without this pr changes:

./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 8192x8192:8192x8192

0:PASSED (3901 ms) __REPRO: --matmul --dt=u8:s8:u8 8192x8192:8192x8192
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 3.90s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 1.56s (40%); execute: 0.69s (18%); compute_ref: 1.36s (35%); compare: 0.28s (7%);   In both you can see the execute time for matmul, with my changes it is more faster than without my changes.

./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 4096x4096:4096x4096

0:PASSED (750 ms) __REPRO: --matmul --dt=u8:s8:u8 4096x4096:4096x4096
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.75s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 0.37s (49%); execute: 0.11s (14%); compute_ref: 0.20s (27%); compare: 0.07s (9%);

If you can see the execute time for gemm kernel for both cases, it is 1.9x faster than earlier onednn.
Like:

Gemm Output Size	With my changes	Without My changes
8192x8192	0.36s	0.69s
4096x4096	0.05s	0.11s

I have run thorugh the pytorch and vllm as well and it is giving me the performance boost.

Please let me know if anything more is required.

spalicki · 2025-05-19T22:52:38Z

@Tiwari-Avanish Those 2 cases are not very representative of overall DL performance.

We have prepared some batch files with cases extracted directly from some models that would be a better way of measuring performance difference. If you take a look at the benchdnn/inputs directory you will see a lot of different files for most drivers, e.g. for matmul you should see:

spalicki@localhost:~/workspace/oneDNN/build$ ls ./tests/benchdnn/inputs/matmul/ | grep perf_matmul
perf_matmul_inference_batched
perf_matmul_inference_lb
perf_matmul_training

You can run them as e.g. ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --mode=P --batch=./tests/benchdnn/inputs/matmul/perf_matmul_training, the --mode=P option is used for measuring performance (i.e. skips correctness validation and such). Make sure that you use program like numactl and parallel to run low batch parallel workloads and get accurate data. It can be done in a script similar to (LB usually means large batch - to be run on a full socket - SB is small batch - to be run on a few cores):

#!/bin/bash
LB=("perf_matmul_inference_lb" "perf_matmul_training" "custom batch file for full socket testing")
SB=("harness_matmul_bert_inf_sb_int8" "custom batch file for limited core testing")

for CASE in "${LB[@]}"
do
    echo $CASE
    # Run each case on full socket 0
    OMP_PROC_BIND=spread OMP_PLACES=threads numactl --membind=0 --cpunodebind=0 benchdnn --mode=P --matmul --batch=inputs/matmul/${CASE} "&>" ${CASE}.csv
done

for CASE in "${SB[@]}"
do
    echo $CASE
    # Run each case in parallel on a subset of available cores defined in cores_sb.txt file
    cat cores_sb.txt | parallel --colsep ' ' -j7 KMP_HW_SUBSET=1T OMP_PROC_BIND=close OMP_PLACES=threads numactl --membind={1} --physcpubind={2} ./tests/benchdnn/benchdnn --mode=P --matmul --batch=./tests/benchdnn/inputs/matmul/${CASE} "&>" ${CASE}_{2}.csv
done

where cores_sb.txt is a file with a list of sockets and cores on those sockets to be used by each instance:

This will give you csv files that you can load to spreadsheet or pandas and then easily compare with each other. This script can be compressed to several lines if using tools like parallel, but it is easier to explain what I mean this way.
More on the subject here: dev_guide_performance_settings.

Also, if you add -DDNNL_TEST_SET=NIGHTLY to your cmake command it will perform more thorough functional/correctness tests when running ctest, e.g. before pushing any code you can check it with:

cmake .. <flags you usually put in> -DDNNL_TEST_SET=NIGHTLY
cmake --build . --parallel
ctest

Tiwari-Avanish · 2025-05-20T09:26:22Z

Thanks @spalicki for giving me the step to collect the perf data. This is my first time to work with onednn so, it is taking some time.
I will collect these data, just need to figure it out about numactl for power.

Tiwari-Avanish · 2025-05-30T22:44:51Z

Hi @spalicki ,

So, i have run the onednn as you have mentioned above. With -DDNNL_TEST_SET=NIGHTLY arguments 6-7 tests cases was failing, but now that got fixed and i have push those changes as well.

Tests Log:

Output logs of ctest

        Start   1: cpu-bnorm-u8-via-binary-postops-cpp
  1/335 Test   #1: cpu-bnorm-u8-via-binary-postops-cpp .....................   Passed    0.02 sec
        Start   2: cpu-cnn-inference-f32-c
  2/335 Test   #2: cpu-cnn-inference-f32-c .................................   Passed    0.03 sec
        Start   3: cpu-cnn-inference-f32-cpp
  3/335 Test   #3: cpu-cnn-inference-f32-cpp ...............................   Passed    1.31 sec
        Start   4: cpu-cnn-inference-int8-cpp
  4/335 Test   #4: cpu-cnn-inference-int8-cpp ..............................   Passed    0.06 sec
        Start   5: cpu-cnn-training-bf16-cpp
  5/335 Test   #5: cpu-cnn-training-bf16-cpp ...............................   Passed    0.05 sec
        Start   6: cpu-cnn-training-f32-cpp
  6/335 Test   #6: cpu-cnn-training-f32-cpp ................................   Passed    0.23 sec
        Start   7: cpu-cnn-training-f32-c
  7/335 Test   #7: cpu-cnn-training-f32-c ..................................   Passed    0.36 sec
        Start   8: cpu-matmul-coo-cpp
  8/335 Test   #8: cpu-matmul-coo-cpp ......................................   Passed    0.01 sec
        Start   9: cpu-matmul-csr-cpp
  9/335 Test   #9: cpu-matmul-csr-cpp ......................................   Passed    0.01 sec
        Start  10: cpu-matmul-weights-compression-cpp
 10/335 Test  #10: cpu-matmul-weights-compression-cpp ......................   Passed    0.04 sec
        Start  11: cpu-rnn-inference-f32-cpp
 11/335 Test  #11: cpu-rnn-inference-f32-cpp ...............................   Passed    0.33 sec
        Start  12: cpu-rnn-inference-int8-cpp
 12/335 Test  #12: cpu-rnn-inference-int8-cpp ..............................   Passed    0.01 sec
        Start  13: cpu-getting-started-cpp
 13/335 Test  #13: cpu-getting-started-cpp .................................   Passed    0.01 sec
        Start  14: cpu-graph-getting-started-cpp
 14/335 Test  #14: cpu-graph-getting-started-cpp ...........................   Passed    0.05 sec
        Start  15: cpu-graph-inference-int8-cpp
 15/335 Test  #15: cpu-graph-inference-int8-cpp ............................   Passed    0.04 sec
        Start  16: cpu-graph-single-op-partition-cpp
 16/335 Test  #16: cpu-graph-single-op-partition-cpp .......................   Passed    0.01 sec
        Start  17: cpu-graph-gated-mlp-cpp
 17/335 Test  #17: cpu-graph-gated-mlp-cpp .................................   Passed    1.60 sec
        Start  18: cpu-graph-gated-mlp-int4-cpp
 18/335 Test  #18: cpu-graph-gated-mlp-int4-cpp ............................   Passed    3.18 sec
        Start  19: cpu-graph-gated-mlp-wei-combined-cpp
 19/335 Test  #19: cpu-graph-gated-mlp-wei-combined-cpp ....................   Passed    1.40 sec
        Start  20: cpu-graph-gqa-cpp
 20/335 Test  #20: cpu-graph-gqa-cpp .......................................   Passed    3.35 sec
        Start  21: cpu-graph-mqa-cpp
 21/335 Test  #21: cpu-graph-mqa-cpp .......................................   Passed    3.39 sec
        Start  22: cpu-graph-sdpa-cpp
 22/335 Test  #22: cpu-graph-sdpa-cpp ......................................   Passed    6.58 sec
        Start  23: cpu-graph-sdpa-stacked-qkv-cpp
 23/335 Test  #23: cpu-graph-sdpa-stacked-qkv-cpp ..........................   Passed    3.81 sec
        Start  24: cpu-matmul-perf-cpp
 24/335 Test  #24: cpu-matmul-perf-cpp .....................................   Passed    7.56 sec
        Start  25: cpu-memory-format-propagation-cpp
 25/335 Test  #25: cpu-memory-format-propagation-cpp .......................   Passed    0.01 sec
        Start  26: cpu-performance-profiling-cpp
 26/335 Test  #26: cpu-performance-profiling-cpp ...........................   Passed    0.32 sec
        Start  27: cpu-primitives-augru-cpp
 27/335 Test  #27: cpu-primitives-augru-cpp ................................   Passed    0.05 sec
        Start  28: cpu-primitives-batch-normalization-cpp
 28/335 Test  #28: cpu-primitives-batch-normalization-cpp ..................   Passed    0.02 sec
        Start  29: cpu-primitives-binary-cpp
 29/335 Test  #29: cpu-primitives-binary-cpp ...............................   Passed    0.01 sec
        Start  30: cpu-primitives-concat-cpp
 30/335 Test  #30: cpu-primitives-concat-cpp ...............................   Passed    0.01 sec
        Start  31: cpu-primitives-convolution-cpp
 31/335 Test  #31: cpu-primitives-convolution-cpp ..........................   Passed    0.01 sec
        Start  32: cpu-primitives-deconvolution-cpp
 32/335 Test  #32: cpu-primitives-deconvolution-cpp ........................   Passed    0.02 sec
        Start  33: cpu-primitives-eltwise-cpp
 33/335 Test  #33: cpu-primitives-eltwise-cpp ..............................   Passed    0.02 sec
        Start  34: cpu-primitives-group-normalization-cpp

34/335 Test  #34: cpu-primitives-group-normalization-cpp ..................   Passed    0.31 sec
        Start  35: cpu-primitives-inner-product-cpp
 35/335 Test  #35: cpu-primitives-inner-product-cpp ........................   Passed    0.21 sec
        Start  36: cpu-primitives-layer-normalization-cpp
 36/335 Test  #36: cpu-primitives-layer-normalization-cpp ..................   Passed    0.01 sec
        Start  37: cpu-primitives-lbr-gru-cpp
 37/335 Test  #37: cpu-primitives-lbr-gru-cpp ..............................   Passed    0.02 sec
        Start  38: cpu-primitives-lrn-cpp
 38/335 Test  #38: cpu-primitives-lrn-cpp ..................................   Passed    0.02 sec
        Start  39: cpu-primitives-lstm-cpp
 39/335 Test  #39: cpu-primitives-lstm-cpp .................................   Passed    0.08 sec
        Start  40: cpu-primitives-matmul-cpp
 40/335 Test  #40: cpu-primitives-matmul-cpp ...............................   Passed    0.02 sec
        Start  41: cpu-primitives-pooling-cpp
 41/335 Test  #41: cpu-primitives-pooling-cpp ..............................   Passed    0.01 sec
        Start  42: cpu-primitives-prelu-cpp
 42/335 Test  #42: cpu-primitives-prelu-cpp ................................   Passed    0.02 sec
        Start  43: cpu-primitives-reduction-cpp
 43/335 Test  #43: cpu-primitives-reduction-cpp ............................   Passed    0.02 sec
        Start  44: cpu-primitives-reorder-cpp
 44/335 Test  #44: cpu-primitives-reorder-cpp ..............................   Passed    0.02 sec
        Start  45: cpu-primitives-resampling-cpp
 45/335 Test  #45: cpu-primitives-resampling-cpp ...........................   Passed    0.02 sec
        Start  46: cpu-primitives-shuffle-cpp
 46/335 Test  #46: cpu-primitives-shuffle-cpp ..............................   Passed    0.15 sec
        Start  47: cpu-primitives-softmax-cpp
 47/335 Test  #47: cpu-primitives-softmax-cpp ..............................   Passed    0.01 sec
        Start  48: cpu-primitives-sum-cpp
 48/335 Test  #48: cpu-primitives-sum-cpp ..................................   Passed    0.02 sec
        Start  49: cpu-primitives-vanilla-rnn-cpp
 49/335 Test  #49: cpu-primitives-vanilla-rnn-cpp ..........................   Passed    0.02 sec
        Start  50: cpu-rnn-training-f32-cpp
 50/335 Test  #50: cpu-rnn-training-f32-cpp ................................   Passed    0.24 sec
        Start  51: cpu-tutorials-matmul-matmul-quantization-cpp
 51/335 Test  #51: cpu-tutorials-matmul-matmul-quantization-cpp ............   Passed    0.01 sec
        Start  52: cpu-tutorials-matmul-sgemm-and-matmul-cpp
 52/335 Test  #52: cpu-tutorials-matmul-sgemm-and-matmul-cpp ...............   Passed    0.01 sec
        Start  53: cpu-tutorials-matmul-inference-int8-matmul-cpp
 53/335 Test  #53: cpu-tutorials-matmul-inference-int8-matmul-cpp ..........   Passed    0.02 sec
        Start  54: cpu-tutorials-matmul-weights-decompression-matmul-cpp
 54/335 Test  #54: cpu-tutorials-matmul-weights-decompression-matmul-cpp ...   Passed    0.03 sec
        Start  55: api-c
 55/335 Test  #55: api-c ...................................................   Passed    0.22 sec
        Start  56: test_c_symbols-c
 56/335 Test  #56: test_c_symbols-c ........................................   Passed    0.01 sec
        Start  57: test_batch_normalization
 57/335 Test  #57: test_batch_normalization ................................   Passed    0.26 sec
        Start  58: test_binary
 58/335 Test  #58: test_binary .............................................   Passed    0.67 sec
        Start  59: test_concat
 59/335 Test  #59: test_concat .............................................   Passed    0.28 sec
        Start  60: test_concurrency
 60/335 Test  #60: test_concurrency ........................................   Passed    0.04 sec
        Start  61: test_convolution_backward_data_f32
 61/335 Test  #61: test_convolution_backward_data_f32 ......................   Passed    2.40 sec
        Start  62: test_convolution_backward_weights_f32
 62/335 Test  #62: test_convolution_backward_weights_f32 ...................   Passed    5.45 sec
        Start  63: test_convolution_eltwise_forward_f32
 63/335 Test  #63: test_convolution_eltwise_forward_f32 ....................   Passed    2.79 sec
        Start  64: test_convolution_eltwise_forward_x8s8f32s32
 64/335 Test  #64: test_convolution_eltwise_forward_x8s8f32s32 .............   Passed    0.88 sec
        Start  65: test_convolution_forward_f32
 65/335 Test  #65: test_convolution_forward_f32 ............................   Passed    2.24 sec
        Start  66: test_convolution_forward_u8s8fp
 66/335 Test  #66: test_convolution_forward_u8s8fp .........................   Passed    0.09 sec
        Start  67: test_convolution_forward_u8s8s32
67/335 Test  #67: test_convolution_forward_u8s8s32 ........................   Passed    0.09 sec
        Start  68: test_cross_engine_reorder
 68/335 Test  #68: test_cross_engine_reorder ...............................   Passed    0.01 sec
        Start  69: test_deconvolution
 69/335 Test  #69: test_deconvolution ......................................   Passed    0.32 sec
        Start  70: test_eltwise
 70/335 Test  #70: test_eltwise ............................................   Passed    0.38 sec
        Start  71: test_group_normalization
 71/335 Test  #71: test_group_normalization ................................   Passed    0.32 sec
        Start  72: test_iface_attr
 72/335 Test  #72: test_iface_attr .........................................   Passed    0.02 sec
        Start  73: test_iface_attr_quantization
 73/335 Test  #73: test_iface_attr_quantization ............................   Passed    0.01 sec
        Start  74: test_iface_binary_bcast
 74/335 Test  #74: test_iface_binary_bcast .................................   Passed    0.01 sec
        Start  75: test_iface_handle
 75/335 Test  #75: test_iface_handle .......................................   Passed    0.01 sec
        Start  76: test_iface_pd
 76/335 Test  #76: test_iface_pd ...........................................   Passed    0.01 sec
        Start  77: test_iface_pd_iter
 77/335 Test  #77: test_iface_pd_iter ......................................   Passed    0.01 sec
        Start  78: test_iface_primitive_cache
 78/335 Test  #78: test_iface_primitive_cache ..............................   Passed    0.01 sec
        Start  79: test_iface_runtime_dims
 79/335 Test  #79: test_iface_runtime_dims .................................   Passed    0.01 sec
        Start  80: test_iface_sparse
 80/335 Test  #80: test_iface_sparse .......................................   Passed    0.01 sec
        Start  81: test_iface_weights_format
 81/335 Test  #81: test_iface_weights_format ...............................   Passed    0.01 sec
        Start  82: test_iface_wino_convolution
 82/335 Test  #82: test_iface_wino_convolution .............................   Passed    0.01 sec
        Start  83: test_inner_product_backward_data
 83/335 Test  #83: test_inner_product_backward_data ........................   Passed    0.21 sec
        Start  84: test_inner_product_backward_weights
 84/335 Test  #84: test_inner_product_backward_weights .....................   Passed    0.33 sec
        Start  85: test_inner_product_forward
 85/335 Test  #85: test_inner_product_forward ..............................   Passed    0.40 sec
        Start  86: test_layer_normalization
 86/335 Test  #86: test_layer_normalization ................................   Passed    0.26 sec
        Start  87: test_lrn
 87/335 Test  #87: test_lrn ................................................   Passed    0.82 sec
        Start  88: test_matmul
 88/335 Test  #88: test_matmul .............................................   Passed    0.05 sec
        Start  89: test_persistent_cache_api
 89/335 Test  #89: test_persistent_cache_api ...............................   Passed    0.01 sec
        Start  90: test_pooling_backward
 90/335 Test  #90: test_pooling_backward ...................................   Passed    2.90 sec
        Start  91: test_pooling_forward
 91/335 Test  #91: test_pooling_forward ....................................   Passed    4.42 sec
        Start  92: test_prelu
 92/335 Test  #92: test_prelu ..............................................   Passed    0.12 sec
        Start  93: test_primitive_cache_mt
 93/335 Test  #93: test_primitive_cache_mt .................................   Passed    0.01 sec
        Start  94: test_reduction
 94/335 Test  #94: test_reduction ..........................................   Passed    0.05 sec
        Start  95: test_reorder
 95/335 Test  #95: test_reorder ............................................   Passed    1.04 sec
        Start  96: test_resampling
 96/335 Test  #96: test_resampling .........................................   Passed    0.19 sec
        Start  97: test_rnn_forward
 97/335 Test  #97: test_rnn_forward ........................................   Passed    0.34 sec
        Start  98: test_shuffle
 98/335 Test  #98: test_shuffle ............................................   Passed    0.13 secStart  99: test_softmax
 99/335 Test  #99: test_softmax ............................................   Passed    0.15 sec
        Start 100: test_sum
100/335 Test #100: test_sum ................................................   Passed    0.68 sec
        Start 101: test_convolution_format_any
101/335 Test #101: test_convolution_format_any .............................   Passed    0.01 sec
        Start 102: test_gemm_bf16bf16bf16
102/335 Test #102: test_gemm_bf16bf16bf16 ..................................   Passed    0.01 sec
        Start 103: test_gemm_bf16bf16f32
103/335 Test #103: test_gemm_bf16bf16f32 ...................................   Passed    0.01 sec
        Start 104: test_gemm_f16
104/335 Test #104: test_gemm_f16 ...........................................   Passed    0.01 sec
        Start 105: test_gemm_f16f16f32
105/335 Test #105: test_gemm_f16f16f32 .....................................   Passed    0.01 sec
        Start 106: test_gemm_f32
106/335 Test #106: test_gemm_f32 ...........................................   Passed    0.85 sec
        Start 107: test_gemm_s8s8s32
107/335 Test #107: test_gemm_s8s8s32 .......................................   Passed    0.40 sec
        Start 108: test_gemm_s8u8s32
108/335 Test #108: test_gemm_s8u8s32 .......................................   Passed    0.01 sec
        Start 109: test_gemm_u8s8s32
109/335 Test #109: test_gemm_u8s8s32 .......................................   Passed    0.35 sec
        Start 110: test_gemm_u8u8s32
110/335 Test #110: test_gemm_u8u8s32 .......................................   Passed    0.02 sec
        Start 111: test_global_scratchpad
111/335 Test #111: test_global_scratchpad ..................................   Passed    0.01 sec
        Start 112: test_ip_formats
112/335 Test #112: test_ip_formats .........................................   Passed    1.69 sec
        Start 113: test_reorder_formats
113/335 Test #113: test_reorder_formats ....................................   Passed    1.70 sec
        Start 114: test_api
114/335 Test #114: test_api ................................................   Passed    0.09 sec
        Start 115: test_internals_env_vars_dnnl
115/335 Test #115: test_internals_env_vars_dnnl ............................   Passed    0.01 sec
        Start 116: test_internals_env_vars_onednn
116/335 Test #116: test_internals_env_vars_onednn ..........................   Passed    0.01 sec
        Start 117: test_internals
117/335 Test #117: test_internals ..........................................   Passed    0.04 sec
        Start 118: test_regression
118/335 Test #118: test_regression .........................................   Passed    0.01 sec
        Start 119: test_graph_c_api_add_op_cpu
119/335 Test #119: test_graph_c_api_add_op_cpu .............................   Passed    0.01 sec
        Start 120: test_graph_c_api_constant_cache_cpu
120/335 Test #120: test_graph_c_api_constant_cache_cpu .....................   Passed    0.01 sec
        Start 121: test_graph_c_api_filter_cpu
121/335 Test #121: test_graph_c_api_filter_cpu .............................   Passed    0.01 sec
        Start 122: test_graph_c_api_graph_cpu
122/335 Test #122: test_graph_c_api_graph_cpu ..............................   Passed    0.01 sec
        Start 123: test_graph_c_api_logical_tensor_cpu
123/335 Test #123: test_graph_c_api_logical_tensor_cpu .....................   Passed    0.01 sec
        Start 124: test_graph_c_api_op_cpu
124/335 Test #124: test_graph_c_api_op_cpu .................................   Passed    0.01 sec
        Start 125: test_graph_cpp_api_constant_cache_cpu
125/335 Test #125: test_graph_cpp_api_constant_cache_cpu ...................   Passed    0.01 sec
        Start 126: test_graph_cpp_api_engine_cpu
126/335 Test #126: test_graph_cpp_api_engine_cpu ...........................   Passed    0.02 sec
        Start 127: test_graph_cpp_api_graph_cpu
127/335 Test #127: test_graph_cpp_api_graph_cpu ............................   Passed    0.01 sec
        Start 128: test_graph_cpp_api_logical_tensor_cpu
128/335 Test #128: test_graph_cpp_api_logical_tensor_cpu ...................   Passed    0.01 sec
        Start 129: test_graph_cpp_api_op_cpu
129/335 Test #129: test_graph_cpp_api_op_cpu ...............................   Passed    0.01 sec
Start 130: test_graph_cpp_api_tensor_cpu
130/335 Test #130: test_graph_cpp_api_tensor_cpu ...........................   Passed    0.01 sec
        Start 131: test_graph_c_api_compile_cpu
131/335 Test #131: test_graph_c_api_compile_cpu ............................   Passed    0.01 sec
        Start 132: test_graph_c_api_compile_parametrized_cpu
132/335 Test #132: test_graph_c_api_compile_parametrized_cpu ...............   Passed    0.01 sec
        Start 133: test_graph_cpp_api_compile_cpu
133/335 Test #133: test_graph_cpp_api_compile_cpu ..........................   Passed    0.01 sec
        Start 134: test_graph_cpp_api_partition_cpu
134/335 Test #134: test_graph_cpp_api_partition_cpu ........................   Passed    0.01 sec
        Start 135: test_graph_unit_interface_allocator_cpu
135/335 Test #135: test_graph_unit_interface_allocator_cpu .................   Passed    0.01 sec
        Start 136: test_graph_unit_interface_compiled_partition_cpu
136/335 Test #136: test_graph_unit_interface_compiled_partition_cpu ........   Passed    0.01 sec
        Start 137: test_graph_unit_interface_partition_hashing_cpu
137/335 Test #137: test_graph_unit_interface_partition_hashing_cpu .........   Passed    0.01 sec
        Start 138: test_graph_unit_interface_tensor_cpu
138/335 Test #138: test_graph_unit_interface_tensor_cpu ....................   Passed    0.01 sec
        Start 139: test_graph_unit_interface_backend_cpu
139/335 Test #139: test_graph_unit_interface_backend_cpu ...................   Passed    0.01 sec
        Start 140: test_graph_unit_interface_graph_cpu
140/335 Test #140: test_graph_unit_interface_graph_cpu .....................   Passed    0.01 sec
        Start 141: test_graph_unit_interface_logical_tensor_cpu
141/335 Test #141: test_graph_unit_interface_logical_tensor_cpu ............   Passed    0.01 sec
        Start 142: test_graph_unit_interface_op_cpu
142/335 Test #142: test_graph_unit_interface_op_cpu ........................   Passed    0.01 sec
        Start 143: test_graph_unit_interface_op_def_constraint_cpu
143/335 Test #143: test_graph_unit_interface_op_def_constraint_cpu .........   Passed    0.01 sec
        Start 144: test_graph_unit_interface_op_schema_cpu
144/335 Test #144: test_graph_unit_interface_op_schema_cpu .................   Passed    0.01 sec
        Start 145: test_graph_unit_interface_shape_infer_cpu
145/335 Test #145: test_graph_unit_interface_shape_infer_cpu ...............   Passed    0.01 sec
        Start 146: test_graph_unit_interface_value_cpu
146/335 Test #146: test_graph_unit_interface_value_cpu .....................   Passed    0.01 sec
        Start 147: test_graph_unit_fake_cpu
147/335 Test #147: test_graph_unit_fake_cpu ................................   Passed    0.01 sec
        Start 148: test_graph_unit_dnnl_dnnl_infer_shape_cpu
148/335 Test #148: test_graph_unit_dnnl_dnnl_infer_shape_cpu ...............   Passed    0.01 sec
        Start 149: test_graph_unit_dnnl_dnnl_utils_cpu
149/335 Test #149: test_graph_unit_dnnl_dnnl_utils_cpu .....................   Passed    0.01 sec
        Start 150: test_graph_unit_dnnl_fusion_info_cpu
150/335 Test #150: test_graph_unit_dnnl_fusion_info_cpu ....................   Passed    0.01 sec
        Start 151: test_graph_unit_dnnl_graph_cpu
151/335 Test #151: test_graph_unit_dnnl_graph_cpu ..........................   Passed    0.01 sec
        Start 152: test_graph_unit_dnnl_insert_ops_cpu
152/335 Test #152: test_graph_unit_dnnl_insert_ops_cpu .....................   Passed    0.01 sec
        Start 153: test_graph_unit_dnnl_internal_attrs_cpu
153/335 Test #153: test_graph_unit_dnnl_internal_attrs_cpu .................   Passed    0.01 sec
        Start 154: test_graph_unit_dnnl_layout_id_cpu
154/335 Test #154: test_graph_unit_dnnl_layout_id_cpu ......................   Passed    0.01 sec
        Start 155: test_graph_unit_dnnl_layout_propagator_cpu
155/335 Test #155: test_graph_unit_dnnl_layout_propagator_cpu ..............   Passed    0.01 sec
        Start 156: test_graph_unit_dnnl_logical_tensor_cpu
156/335 Test #156: test_graph_unit_dnnl_logical_tensor_cpu .................   Passed    0.01 sec
        Start 157: test_graph_unit_dnnl_memory_planning_cpu
157/335 Test #157: test_graph_unit_dnnl_memory_planning_cpu ................   Passed    0.01 sec
        Start 158: test_graph_unit_dnnl_op_schema_cpu
158/335 Test #158: test_graph_unit_dnnl_op_schema_cpu ......................   Passed    0.01 sec
        Start 159: test_graph_unit_dnnl_partition_cpu
159/335 Test #159: test_graph_unit_dnnl_partition_cpu ......................   Passed    0.01 sec
        Start 160: test_graph_unit_dnnl_thread_local_cache_cpu
160/335 Test #160: test_graph_unit_dnnl_thread_local_cache_cpu .............   Passed    0.01 sec
        Start 161: test_graph_unit_dnnl_batch_norm_cpu
161/335 Test #161: test_graph_unit_dnnl_batch_norm_cpu .....................   Passed    0.13 sec
        Start 162: test_graph_unit_dnnl_binary_op_cpu
162/335 Test #162: test_graph_unit_dnnl_binary_op_cpu ......................   Passed    0.18 sec
        Start 163: test_graph_unit_dnnl_bmm_cpu
163/335 Test #163: test_graph_unit_dnnl_bmm_cpu ............................   Passed    0.11 sec
        Start 164: test_graph_unit_dnnl_common_cpu
164/335 Test #164: test_graph_unit_dnnl_common_cpu .........................   Passed    0.01 sec
        Start 165: test_graph_unit_dnnl_compiled_partition_cpu
165/335 Test #165: test_graph_unit_dnnl_compiled_partition_cpu .............   Passed    0.02 sec
        Start 166: test_graph_unit_dnnl_concat_cpu
166/335 Test #166: test_graph_unit_dnnl_concat_cpu .........................   Passed    0.10 sec
        Start 167: test_graph_unit_dnnl_constant_cache_cpu
167/335 Test #167: test_graph_unit_dnnl_constant_cache_cpu .................   Passed    0.01 sec
        Start 168: test_graph_unit_dnnl_convolution_cpu
168/335 Test #168: test_graph_unit_dnnl_convolution_cpu ....................   Passed    1.60 sec
        Start 169: test_graph_unit_dnnl_convtranspose_cpu
169/335 Test #169: test_graph_unit_dnnl_convtranspose_cpu ..................   Passed    1.41 sec
        Start 170: test_graph_unit_dnnl_dequantize_cpu
170/335 Test #170: test_graph_unit_dnnl_dequantize_cpu .....................   Passed    0.03 sec
        Start 171: test_graph_unit_dnnl_eltwise_cpu
171/335 Test #171: test_graph_unit_dnnl_eltwise_cpu ........................   Passed    0.11 sec
        Start 172: test_graph_unit_dnnl_group_norm_cpu
172/335 Test #172: test_graph_unit_dnnl_group_norm_cpu .....................   Passed    0.02 sec
        Start 173: test_graph_unit_dnnl_interpolate_cpu
173/335 Test #173: test_graph_unit_dnnl_interpolate_cpu ....................   Passed    0.05 sec
        Start 174: test_graph_unit_dnnl_large_partition_cpu
174/335 Test #174: test_graph_unit_dnnl_large_partition_cpu ................   Passed    0.36 sec
        Start 175: test_graph_unit_dnnl_layer_norm_cpu
175/335 Test #175: test_graph_unit_dnnl_layer_norm_cpu .....................   Passed    0.03 sec
        Start 176: test_graph_unit_dnnl_matmul_cpu
176/335 Test #176: test_graph_unit_dnnl_matmul_cpu .........................   Passed    3.65 sec
        Start 177: test_graph_unit_dnnl_mqa_decomp_cpu
177/335 Test #177: test_graph_unit_dnnl_mqa_decomp_cpu .....................   Passed   42.63 sec
        Start 178: test_graph_unit_dnnl_op_executable_cpu
178/335 Test #178: test_graph_unit_dnnl_op_executable_cpu ..................   Passed    0.01 sec
        Start 179: test_graph_unit_dnnl_pass_cpu
179/335 Test #179: test_graph_unit_dnnl_pass_cpu ...........................   Passed    0.05 sec
        Start 180: test_graph_unit_dnnl_pool_cpu
180/335 Test #180: test_graph_unit_dnnl_pool_cpu ...........................   Passed    0.28 sec
        Start 181: test_graph_unit_dnnl_prelu_cpu
181/335 Test #181: test_graph_unit_dnnl_prelu_cpu ..........................   Passed    0.04 sec
        Start 182: test_graph_unit_dnnl_quantize_cpu
182/335 Test #182: test_graph_unit_dnnl_quantize_cpu .......................   Passed    0.03 sec
        Start 183: test_graph_unit_dnnl_reduce_cpu
183/335 Test #183: test_graph_unit_dnnl_reduce_cpu .........................   Passed    0.21 sec
        Start 184: test_graph_unit_dnnl_reorder_cpu
184/335 Test #184: test_graph_unit_dnnl_reorder_cpu ........................   Passed    0.02 sec
        Start 185: test_graph_unit_dnnl_scratchpad_cpu
185/335 Test #185: test_graph_unit_dnnl_scratchpad_cpu .....................   Passed    0.01 sec
        Start 186: test_graph_unit_dnnl_sdp_decomp_cpu
186/335 Test #186: test_graph_unit_dnnl_sdp_decomp_cpu .....................   Passed   15.21 sec
        Start 187: test_graph_unit_dnnl_select_cpu
187/335 Test #187: test_graph_unit_dnnl_select_cpu .........................   Passed    0.02 sec
        Start 188: test_graph_unit_dnnl_softmax_cpu
                                                                                                           188/335 Test #188: test_graph_unit_dnnl_softmax_cpu ........................   Passed    0.02 sec
        Start 189: test_graph_unit_dnnl_subgraph_pass_cpu
189/335 Test #189: test_graph_unit_dnnl_subgraph_pass_cpu ..................   Passed    0.11 sec
        Start 190: test_graph_unit_dnnl_typecast_cpu
190/335 Test #190: test_graph_unit_dnnl_typecast_cpu .......................   Passed    0.02 sec
        Start 191: test_graph_unit_utils_allocator_cpu
191/335 Test #191: test_graph_unit_utils_allocator_cpu .....................   Passed    0.01 sec
        Start 192: test_graph_unit_utils_attribute_value_cpu
192/335 Test #192: test_graph_unit_utils_attribute_value_cpu ...............   Passed    0.01 sec
        Start 193: test_graph_unit_utils_debug_cpu
193/335 Test #193: test_graph_unit_utils_debug_cpu .........................   Passed    0.01 sec
        Start 194: test_graph_unit_utils_json_cpu
194/335 Test #194: test_graph_unit_utils_json_cpu ..........................   Passed    0.01 sec
        Start 195: test_graph_unit_utils_pattern_matcher_cpu
195/335 Test #195: test_graph_unit_utils_pattern_matcher_cpu ...............   Passed    0.01 sec
        Start 196: test_graph_unit_utils_utils_cpu
196/335 Test #196: test_graph_unit_utils_utils_cpu .........................   Passed    0.01 sec
        Start 197: test_benchdnn_modeC_binary_all_cpu
197/335 Test #197: test_benchdnn_modeC_binary_all_cpu ......................   Passed    8.11 sec
        Start 198: test_benchdnn_modeC_binary_bfloat16_cpu
198/335 Test #198: test_benchdnn_modeC_binary_bfloat16_cpu .................   Passed    0.14 sec
        Start 199: test_benchdnn_modeC_binary_float16_cpu
199/335 Test #199: test_benchdnn_modeC_binary_float16_cpu ..................   Passed    0.10 sec
        Start 200: test_benchdnn_modeC_bnorm_all_blocked_cpu
200/335 Test #200: test_benchdnn_modeC_bnorm_all_blocked_cpu ...............   Passed    0.10 sec
        Start 201: test_benchdnn_modeC_bnorm_all_plain_cpu
201/335 Test #201: test_benchdnn_modeC_bnorm_all_plain_cpu .................   Passed   57.50 sec
        Start 202: test_benchdnn_modeC_bnorm_bfloat16_blocked_cpu
202/335 Test #202: test_benchdnn_modeC_bnorm_bfloat16_blocked_cpu ..........   Passed    0.05 sec
        Start 203: test_benchdnn_modeC_bnorm_bfloat16_plain_cpu
203/335 Test #203: test_benchdnn_modeC_bnorm_bfloat16_plain_cpu ............   Passed    0.05 sec
        Start 204: test_benchdnn_modeC_bnorm_float16_plain_cpu
204/335 Test #204: test_benchdnn_modeC_bnorm_float16_plain_cpu .............   Passed    0.05 sec
        Start 205: test_benchdnn_modeC_bnorm_regressions_cpu
205/335 Test #205: test_benchdnn_modeC_bnorm_regressions_cpu ...............   Passed   43.51 sec
        Start 206: test_benchdnn_modeC_bnorm_regressions_large_cpu
206/335 Test #206: test_benchdnn_modeC_bnorm_regressions_large_cpu .........   Passed  110.65 sec
        Start 207: test_benchdnn_modeC_brgemm_bf16_cpu
207/335 Test #207: test_benchdnn_modeC_brgemm_bf16_cpu .....................   Passed    0.01 sec
        Start 208: test_benchdnn_modeC_brgemm_f16_cpu
208/335 Test #208: test_benchdnn_modeC_brgemm_f16_cpu ......................   Passed    0.01 sec
        Start 209: test_benchdnn_modeC_brgemm_f32_cpu
209/335 Test #209: test_benchdnn_modeC_brgemm_f32_cpu ......................   Passed    0.01 sec
        Start 210: test_benchdnn_modeC_brgemm_f8_cpu
210/335 Test #210: test_benchdnn_modeC_brgemm_f8_cpu .......................   Passed    0.01 sec
        Start 211: test_benchdnn_modeC_brgemm_int8_cpu
211/335 Test #211: test_benchdnn_modeC_brgemm_int8_cpu .....................   Passed    0.01 sec
        Start 212: test_benchdnn_modeC_brgemm_regression_cpu
212/335 Test #212: test_benchdnn_modeC_brgemm_regression_cpu ...............   Passed    0.01 sec
        Start 213: test_benchdnn_modeC_concat_all_cpu
213/335 Test #213: test_benchdnn_modeC_concat_all_cpu ......................   Passed    2.05 sec
        Start 214: test_benchdnn_modeC_concat_bfloat16_cpu
214/335 Test #214: test_benchdnn_modeC_concat_bfloat16_cpu .................   Passed    0.22 sec
        Start 215: test_benchdnn_modeC_concat_float16_cpu
215/335 Test #215: test_benchdnn_modeC_concat_float16_cpu ..................   Passed    0.25 sec
        Start 216: test_benchdnn_modeC_conv_3d_cpu
216/335 Test #216: test_benchdnn_modeC_conv_3d_cpu .........................   Passed    0.40 sec
        Start 217: test_benchdnn_modeC_conv_3d_f32_plain_cpu
217/335 Test #217: test_benchdnn_modeC_conv_3d_f32_plain_cpu ...............   Passed    0.54 sec
        Start 218: test_benchdnn_modeC_conv_all_topologies_cpu
218/335 Test #218: test_benchdnn_modeC_conv_all_topologies_cpu .............   Passed    1.45 sec
        Start 219: test_benchdnn_modeC_conv_all_topologies_f32_plain_cpu
219/335 Test #219: test_benchdnn_modeC_conv_all_topologies_f32_plain_cpu ...   Passed    2.84 sec
        Start 220: test_benchdnn_modeC_conv_attrs_cpu
220/335 Test #220: test_benchdnn_modeC_conv_attrs_cpu ......................   Passed   40.76 sec
        Start 221: test_benchdnn_modeC_conv_attrs_f32_plain_cpu
221/335 Test #221: test_benchdnn_modeC_conv_attrs_f32_plain_cpu ............   Passed    0.29 sec
        Start 222: test_benchdnn_modeC_conv_bfloat16_cpu
222/335 Test #222: test_benchdnn_modeC_conv_bfloat16_cpu ...................   Passed    1.23 sec
        Start 223: test_benchdnn_modeC_conv_bfloat16_nxc_cpu
223/335 Test #223: test_benchdnn_modeC_conv_bfloat16_nxc_cpu ...............   Passed    1.60 sec
        Start 224: test_benchdnn_modeC_conv_bfloat16_ymm_cpu
224/335 Test #224: test_benchdnn_modeC_conv_bfloat16_ymm_cpu ...............   Passed    0.33 sec
        Start 225: test_benchdnn_modeC_conv_depthwise_cpu
225/335 Test #225: test_benchdnn_modeC_conv_depthwise_cpu ..................   Passed    3.81 sec
        Start 226: test_benchdnn_modeC_conv_dilated_cpu
226/335 Test #226: test_benchdnn_modeC_conv_dilated_cpu ....................   Passed    9.00 sec
        Start 227: test_benchdnn_modeC_conv_dilated_f32_plain_cpu
227/335 Test #227: test_benchdnn_modeC_conv_dilated_f32_plain_cpu ..........   Passed    2.87 sec
        Start 228: test_benchdnn_modeC_conv_dt_cpu
228/335 Test #228: test_benchdnn_modeC_conv_dt_cpu .........................   Passed  294.57 sec
        Start 229: test_benchdnn_modeC_conv_dt_plain_cpu
229/335 Test #229: test_benchdnn_modeC_conv_dt_plain_cpu ...................   Passed    7.16 sec
        Start 230: test_benchdnn_modeC_conv_float16_nxc_cpu
230/335 Test #230: test_benchdnn_modeC_conv_float16_nxc_cpu ................   Passed    1.57 sec
        Start 231: test_benchdnn_modeC_conv_fp4_cpu
231/335 Test #231: test_benchdnn_modeC_conv_fp4_cpu ........................   Passed    0.02 sec
        Start 232: test_benchdnn_modeC_conv_fp8_nxc_cpu
232/335 Test #232: test_benchdnn_modeC_conv_fp8_nxc_cpu ....................   Passed    2.64 sec
        Start 233: test_benchdnn_modeC_conv_function_cpu
233/335 Test #233: test_benchdnn_modeC_conv_function_cpu ...................   Passed    8.01 sec
        Start 234: test_benchdnn_modeC_conv_gemm_bfloat16_cpu
234/335 Test #234: test_benchdnn_modeC_conv_gemm_bfloat16_cpu ..............   Passed    0.17 sec
        Start 235: test_benchdnn_modeC_conv_gemm_bfloat16_nxc_cpu
235/335 Test #235: test_benchdnn_modeC_conv_gemm_bfloat16_nxc_cpu ..........   Passed    0.17 sec
        Start 236: test_benchdnn_modeC_conv_gemm_dt_cpu
236/335 Test #236: test_benchdnn_modeC_conv_gemm_dt_cpu ....................   Passed   42.18 sec
        Start 237: test_benchdnn_modeC_conv_gemm_dt_nxc_cpu
237/335 Test #237: test_benchdnn_modeC_conv_gemm_dt_nxc_cpu ................   Passed  378.95 sec
        Start 238: test_benchdnn_modeC_conv_gemm_int8_cpu
238/335 Test #238: test_benchdnn_modeC_conv_gemm_int8_cpu ..................   Passed    0.24 sec
        Start 239: test_benchdnn_modeC_conv_int8_cpu
239/335 Test #239: test_benchdnn_modeC_conv_int8_cpu .......................   Passed  628.93 sec
        Start 240: test_benchdnn_modeC_conv_regression_cpu
240/335 Test #240: test_benchdnn_modeC_conv_regression_cpu .................   Passed    8.84 sec
        Start 241: test_benchdnn_modeC_conv_wino_f32_cpu
241/335 Test #241: test_benchdnn_modeC_conv_wino_f32_cpu ...................   Passed    0.67 sec
        Start 242: test_benchdnn_modeC_deconv_all_cpu
242/335 Test #242: test_benchdnn_modeC_deconv_all_cpu ......................   Passed    1.18 sec
        Start 243: test_benchdnn_modeC_deconv_all_f32_nxc_cpu
243/335 Test #243: test_benchdnn_modeC_deconv_all_f32_nxc_cpu ..............   Passed    0.19 sec
        Start 244: test_benchdnn_modeC_deconv_bfloat16_cpu
244/335 Test #244: test_benchdnn_modeC_deconv_bfloat16_cpu .................   Passed    0.27 sec
        Start 245: test_benchdnn_modeC_deconv_bfloat16_nxc_cpu
245/335 Test #245: test_benchdnn_modeC_deconv_bfloat16_nxc_cpu .............   Passed    0.32 sec
        Start 246: test_benchdnn_modeC_deconv_bfloat16_ymm_cpu
246/335 Test #246: test_benchdnn_modeC_deconv_bfloat16_ymm_cpu .............   Passed    0.27 sec
        Start 247: test_benchdnn_modeC_deconv_float16_nxc_cpu
247/335 Test #247: test_benchdnn_modeC_deconv_float16_nxc_cpu ..............   Passed    0.30 sec
        Start 248: test_benchdnn_modeC_deconv_fp8_nxc_cpu
248/335 Test #248: test_benchdnn_modeC_deconv_fp8_nxc_cpu ..................   Passed    0.52 sec
        Start 249: test_benchdnn_modeC_deconv_int8_cpu
249/335 Test #249: test_benchdnn_modeC_deconv_int8_cpu .....................   Passed    0.48 sec
        Start 250: test_benchdnn_modeC_eltwise_all_cpu
250/335 Test #250: test_benchdnn_modeC_eltwise_all_cpu .....................   Passed    1.74 sec
        Start 251: test_benchdnn_modeC_eltwise_bfloat16_cpu
251/335 Test #251: test_benchdnn_modeC_eltwise_bfloat16_cpu ................   Passed    0.37 sec
        Start 252: test_benchdnn_modeC_eltwise_float16_cpu
252/335 Test #252: test_benchdnn_modeC_eltwise_float16_cpu .................   Passed    0.16 sec
        Start 253: test_benchdnn_modeC_eltwise_float8_cpu
253/335 Test #253: test_benchdnn_modeC_eltwise_float8_cpu ..................   Passed    0.30 sec
        Start 254: test_benchdnn_modeC_gnorm_all_cpu
254/335 Test #254: test_benchdnn_modeC_gnorm_all_cpu .......................   Passed   64.30 sec
        Start 255: test_benchdnn_modeC_graph_bf16_cpu
255/335 Test #255: test_benchdnn_modeC_graph_bf16_cpu ......................   Passed    0.23 sec
        Start 256: test_benchdnn_modeC_graph_f16_cpu
256/335 Test #256: test_benchdnn_modeC_graph_f16_cpu .......................   Passed    0.22 sec
        Start 257: test_benchdnn_modeC_graph_f32_cpu
257/335 Test #257: test_benchdnn_modeC_graph_f32_cpu .......................   Passed  199.10 sec
        Start 258: test_benchdnn_modeC_graph_f8_cpu
258/335 Test #258: test_benchdnn_modeC_graph_f8_cpu ........................   Passed    1.18 sec
        Start 259: test_benchdnn_modeC_graph_fusions_cpu
259/335 Test #259: test_benchdnn_modeC_graph_fusions_cpu ...................   Passed   59.04 sec
        Start 260: test_benchdnn_modeC_graph_int8_cpu
260/335 Test #260: test_benchdnn_modeC_graph_int8_cpu ......................   Passed   14.91 sec
        Start 261: test_benchdnn_modeC_ip_acl_cpu
261/335 Test #261: test_benchdnn_modeC_ip_acl_cpu ..........................   Passed    0.08 sec
        Start 262: test_benchdnn_modeC_ip_all_cpu
262/335 Test #262: test_benchdnn_modeC_ip_all_cpu ..........................   Passed   91.62 sec
        Start 263: test_benchdnn_modeC_ip_bf32_bfloat16_cpu
263/335 Test #263: test_benchdnn_modeC_ip_bf32_bfloat16_cpu ................   Passed    0.16 sec
        Start 264: test_benchdnn_modeC_ip_bfloat16_cpu
264/335 Test #264: test_benchdnn_modeC_ip_bfloat16_cpu .....................   Passed    0.13 sec
        Start 265: test_benchdnn_modeC_ip_bfloat16_ymm_cpu
265/335 Test #265: test_benchdnn_modeC_ip_bfloat16_ymm_cpu .................   Passed    0.13 sec
        Start 266: test_benchdnn_modeC_ip_float16_cpu
266/335 Test #266: test_benchdnn_modeC_ip_float16_cpu ......................   Passed    0.13 sec
        Start 267: test_benchdnn_modeC_ip_fp8_cpu
267/335 Test #267: test_benchdnn_modeC_ip_fp8_cpu ..........................   Passed    0.11 sec
        Start 268: test_benchdnn_modeC_ip_int8_cpu
268/335 Test #268: test_benchdnn_modeC_ip_int8_cpu .........................   Passed   73.15 sec
        Start 269: test_benchdnn_modeC_lnorm_all_cpu
269/335 Test #269: test_benchdnn_modeC_lnorm_all_cpu .......................   Passed  105.01 sec
        Start 270: test_benchdnn_modeC_lnorm_bfloat16_cpu
270/335 Test #270: test_benchdnn_modeC_lnorm_bfloat16_cpu ..................   Passed    0.08 sec
        Start 271: test_benchdnn_modeC_lnorm_float16_cpu
271/335 Test #271: test_benchdnn_modeC_lnorm_float16_cpu ...................   Passed    0.07 sec
        Start 272: test_benchdnn_modeC_lnorm_int8_cpu
272/335 Test #272: test_benchdnn_modeC_lnorm_int8_cpu ......................   Passed   90.44 sec
        Start 273: test_benchdnn_modeC_lrn_all_cpu
273/335 Test #273: test_benchdnn_modeC_lrn_all_cpu .........................   Passed   11.89 sec
        Start 274: test_benchdnn_modeC_lrn_bfloat16_cpu
274/335 Test #274: test_benchdnn_modeC_lrn_bfloat16_cpu ....................   Passed    0.03 sec
        Start 275: test_benchdnn_modeC_lrn_float16_cpu
275/335 Test #275: test_benchdnn_modeC_lrn_float16_cpu .....................   Passed    0.02 sec
        Start 276: test_benchdnn_modeC_matmul_all_cpu
276/335 Test #276: test_benchdnn_modeC_matmul_all_cpu ......................   Passed  213.88 sec
        Start 277: test_benchdnn_modeC_matmul_bf32_bf16_cpu
277/335 Test #277: test_benchdnn_modeC_matmul_bf32_bf16_cpu ................   Passed    0.24 sec
        Start 278: test_benchdnn_modeC_matmul_bfloat16_cpu
278/335 Test #278: test_benchdnn_modeC_matmul_bfloat16_cpu .................   Passed   21.44 sec
        Start 279: test_benchdnn_modeC_matmul_bfloat16_ymm_cpu
279/335 Test #279: test_benchdnn_modeC_matmul_bfloat16_ymm_cpu .............   Passed   21.90 sec
        Start 280: test_benchdnn_modeC_matmul_float16_cpu
280/335 Test #280: test_benchdnn_modeC_matmul_float16_cpu ..................   Passed    4.89 sec
        Start 281: test_benchdnn_modeC_matmul_fp4_cpu
281/335 Test #281: test_benchdnn_modeC_matmul_fp4_cpu ......................   Passed    0.04 sec
        Start 282: test_benchdnn_modeC_matmul_fp8_cpu
282/335 Test #282: test_benchdnn_modeC_matmul_fp8_cpu ......................   Passed    0.51 sec
        Start 283: test_benchdnn_modeC_matmul_int8_cpu
283/335 Test #283: test_benchdnn_modeC_matmul_int8_cpu .....................   Passed   72.42 sec
        Start 284: test_benchdnn_modeC_matmul_multidims_cpu
284/335 Test #284: test_benchdnn_modeC_matmul_multidims_cpu ................   Passed  149.46 sec
        Start 285: test_benchdnn_modeC_matmul_sparse_cpu
285/335 Test #285: test_benchdnn_modeC_matmul_sparse_cpu ...................   Passed  242.91 sec
        Start 286: test_benchdnn_modeC_pool_all_cpu
286/335 Test #286: test_benchdnn_modeC_pool_all_cpu ........................   Passed   67.40 sec
        Start 287: test_benchdnn_modeC_pool_bfloat16_cpu
287/335 Test #287: test_benchdnn_modeC_pool_bfloat16_cpu ...................   Passed    0.22 sec
        Start 288: test_benchdnn_modeC_pool_float16_cpu
288/335 Test #288: test_benchdnn_modeC_pool_float16_cpu ....................   Passed    0.17 sec
        Start 289: test_benchdnn_modeC_pool_fp8_cpu
289/335 Test #289: test_benchdnn_modeC_pool_fp8_cpu ........................   Passed    0.29 sec
        Start 290: test_benchdnn_modeC_prelu_all_cpu
290/335 Test #290: test_benchdnn_modeC_prelu_all_cpu .......................   Passed  285.45 sec
        Start 291: test_benchdnn_modeC_prelu_bfloat16_cpu
291/335 Test #291: test_benchdnn_modeC_prelu_bfloat16_cpu ..................   Passed    0.07 sec
        Start 292: test_benchdnn_modeC_prelu_float16_cpu
292/335 Test #292: test_benchdnn_modeC_prelu_float16_cpu ...................   Passed    0.07 sec
        Start 293: test_benchdnn_modeC_reduction_all_cpu
293/335 Test #293: test_benchdnn_modeC_reduction_all_cpu ...................   Passed   15.41 sec
        Start 294: test_benchdnn_modeC_reduction_bfloat16_cpu
294/335 Test #294: test_benchdnn_modeC_reduction_bfloat16_cpu ..............   Passed    0.11 sec
        Start 295: test_benchdnn_modeC_reduction_float16_cpu
295/335 Test #295: test_benchdnn_modeC_reduction_float16_cpu ...............   Passed    0.11 sec
        Start 296: test_benchdnn_modeC_reorder_all_cpu
296/335 Test #296: test_benchdnn_modeC_reorder_all_cpu .....................   Passed   17.59 sec
        Start 297: test_benchdnn_modeC_reorder_bfloat16_cpu
297/335 Test #297: test_benchdnn_modeC_reorder_bfloat16_cpu ................   Passed    0.29 sec
        Start 298: test_benchdnn_modeC_reorder_float16_cpu
298/335 Test #298: test_benchdnn_modeC_reorder_float16_cpu .................   Passed    0.48 sec
        Start 299: test_benchdnn_modeC_reorder_float8_cpu
299/335 Test #299: test_benchdnn_modeC_reorder_float8_cpu ..................   Passed    0.24 sec
        Start 300: test_benchdnn_modeC_reorder_fp4_cpu
300/335 Test #300: test_benchdnn_modeC_reorder_fp4_cpu .....................   Passed    0.25 sec
        Start 301: test_benchdnn_modeC_reorder_int4_cpu
301/335 Test #301: test_benchdnn_modeC_reorder_int4_cpu ....................   Passed    0.35 sec
        Start 302: test_benchdnn_modeC_resampling_all_cpu
302/335 Test #302: test_benchdnn_modeC_resampling_all_cpu ..................   Passed   11.00 sec
        Start 303: test_benchdnn_modeC_resampling_bfloat16_cpu
303/335 Test #303: test_benchdnn_modeC_resampling_bfloat16_cpu .............   Passed    0.03 sec
        Start 304: test_benchdnn_modeC_resampling_float16_cpu
304/335 Test #304: test_benchdnn_modeC_resampling_float16_cpu ..............   Passed    0.02 sec
        Start 305: test_benchdnn_modeC_augru_all_cpu
305/335 Test #305: test_benchdnn_modeC_augru_all_cpu .......................   Passed    0.92 sec
        Start 306: test_benchdnn_modeC_augru_bf32_bfloat16_cpu
306/335 Test #306: test_benchdnn_modeC_augru_bf32_bfloat16_cpu .............   Passed    0.32 sec
        Start 307: test_benchdnn_modeC_augru_bfloat16_cpu
307/335 Test #307: test_benchdnn_modeC_augru_bfloat16_cpu ..................   Passed    0.02 sec
        Start 308: test_benchdnn_modeC_augru_float16_cpu
308/335 Test #308: test_benchdnn_modeC_augru_float16_cpu ...................   Passed    0.02 sec
        Start 309: test_benchdnn_modeC_gru_all_cpu
309/335 Test #309: test_benchdnn_modeC_gru_all_cpu .........................   Passed   91.48 sec
        Start 310: test_benchdnn_modeC_gru_bf32_bfloat16_cpu
310/335 Test #310: test_benchdnn_modeC_gru_bf32_bfloat16_cpu ...............   Passed   13.67 sec
        Start 311: test_benchdnn_modeC_gru_bfloat16_cpu
311/335 Test #311: test_benchdnn_modeC_gru_bfloat16_cpu ....................   Passed    0.09 sec
        Start 312: test_benchdnn_modeC_gru_float16_cpu
312/335 Test #312: test_benchdnn_modeC_gru_float16_cpu .....................   Passed    0.08 sec
        Start 313: test_benchdnn_modeC_gru_int8_cpu
313/335 Test #313: test_benchdnn_modeC_gru_int8_cpu ........................   Passed    0.03 sec
        Start 314: test_benchdnn_modeC_lstm_bf32_bfloat16_cpu
314/335 Test #314: test_benchdnn_modeC_lstm_bf32_bfloat16_cpu ..............   Passed   62.20 sec
        Start 315: test_benchdnn_modeC_lstm_bfloat16_cpu
315/335 Test #315: test_benchdnn_modeC_lstm_bfloat16_cpu ...................   Passed    0.31 sec
        Start 316: test_benchdnn_modeC_lstm_bfloat16_ymm_cpu
316/335 Test #316: test_benchdnn_modeC_lstm_bfloat16_ymm_cpu ...............   Passed    0.32 sec
        Start 317: test_benchdnn_modeC_lstm_f32_cpu
317/335 Test #317: test_benchdnn_modeC_lstm_f32_cpu ........................   Passed  327.29 sec
        Start 318: test_benchdnn_modeC_lstm_float16_cpu
318/335 Test #318: test_benchdnn_modeC_lstm_float16_cpu ....................   Passed    0.32 sec
        Start 319: test_benchdnn_modeC_lstm_int8_cpu
319/335 Test #319: test_benchdnn_modeC_lstm_int8_cpu .......................   Passed    0.30 sec
        Start 320: test_benchdnn_modeC_rnn_all_cpu
320/335 Test #320: test_benchdnn_modeC_rnn_all_cpu .........................   Passed  140.04 sec
        Start 321: test_benchdnn_modeC_rnn_bf32_bfloat16_cpu
321/335 Test #321: test_benchdnn_modeC_rnn_bf32_bfloat16_cpu ...............   Passed   20.60 sec
        Start 322: test_benchdnn_modeC_rnn_bfloat16_cpu
322/335 Test #322: test_benchdnn_modeC_rnn_bfloat16_cpu ....................   Passed    0.10 sec
        Start 323: test_benchdnn_modeC_rnn_float16_cpu
323/335 Test #323: test_benchdnn_modeC_rnn_float16_cpu .....................   Passed    0.10 sec
        Start 324: test_benchdnn_modeC_self_f32_cpu
324/335 Test #324: test_benchdnn_modeC_self_f32_cpu ........................   Passed    0.06 sec
        Start 325: test_benchdnn_modeC_shuffle_all_cpu
325/335 Test #325: test_benchdnn_modeC_shuffle_all_cpu .....................   Passed    9.49 sec
        Start 326: test_benchdnn_modeC_shuffle_bfloat16_cpu
326/335 Test #326: test_benchdnn_modeC_shuffle_bfloat16_cpu ................   Passed    0.02 sec
        Start 327: test_benchdnn_modeC_shuffle_float16_cpu
327/335 Test #327: test_benchdnn_modeC_shuffle_float16_cpu .................   Passed    0.02 sec
        Start 328: test_benchdnn_modeC_softmax_acl_cpu
328/335 Test #328: test_benchdnn_modeC_softmax_acl_cpu .....................   Passed    0.03 sec
        Start 329: test_benchdnn_modeC_softmax_all_cpu
329/335 Test #329: test_benchdnn_modeC_softmax_all_cpu .....................   Passed  352.44 sec
        Start 330: test_benchdnn_modeC_softmax_bfloat16_cpu
330/335 Test #330: test_benchdnn_modeC_softmax_bfloat16_cpu ................   Passed    0.23 sec
        Start 331: test_benchdnn_modeC_softmax_float16_cpu
331/335 Test #331: test_benchdnn_modeC_softmax_float16_cpu .................   Passed    0.22 sec
        Start 332: test_benchdnn_modeC_sum_all_cpu
332/335 Test #332: test_benchdnn_modeC_sum_all_cpu .........................   Passed    2.69 sec
        Start 333: test_benchdnn_modeC_sum_bfloat16_cpu
333/335 Test #333: test_benchdnn_modeC_sum_bfloat16_cpu ....................   Passed    0.23 sec
        Start 334: test_benchdnn_modeC_sum_float16_cpu
334/335 Test #334: test_benchdnn_modeC_sum_float16_cpu .....................   Passed    0.23 sec
        Start 335: noexcept-cpp
335/335 Test #335: noexcept-cpp ............................................   Passed    0.01 sec

100% tests passed, 0 tests failed out of 335

Total Test time (real) = 4667.56 sec

I have run the perf* test cases and it is giving me this performance.

perf_matmul_inference_batched

Performance Summary (GFlops Improvement)
Average Improvement: 30.9%

perf_matmul_training

Average Improvement: 581.07%
Median Improvement: 44.02%

perf_matmul_inference_lb

Average Improvement: 3.42x speedup (242% improvement) across all valid benchmarks
Median Improvement: 1.55x speedup (55% improvement)

I am just pasting some Gflops info without onednn changes and with onednn changes:

Perf Matmul Inference LB

Performance Report Of matmul large batch

Name	Original GFLOPs	Modified GFLOPs	Improvement (x)	Improvement (%)
GNMT:0*1	164.214	384.374	2.34x	134.07%
GNMT:1*1	265.652	894.157	3.37x	236.62%
WnD-512:0*1	772.359	1346.7	1.74x	74.36%
WnD-512:1*1	784.183	1187.7	1.51x	51.46%
WnD-512:2*1	453.494	760.867	1.68x	67.78%
resnet:ip1*1	90.1254	1724	19.13x	1812.98%
resnet_sparse:ip1*1	110.76	1480.19	13.36x	1236.62%
googlenet_v1:ip1*1	92.1656	1412.88	15.33x	1433.02%
googlenet_v1:ip2*1	261.521	1333.1	5.10x	409.75%
inceptionv3:ip1*1	139.086	1977.31	14.22x	1321.61%
VGG16:ip1*1	52.6008	4053.91	77.07x	7606.19%
VGG16:ip2*1	65.5996	2542.89	38.76x	3776.38%
VGG16:ip3*1	47.3232	394.87	8.34x	734.36%
VGG16:ip4*1	121.101	683.832	5.65x	464.76%
NCF:0*1	661.168	717.718	1.09x	8.55%
NCF:1*1	558.028	642.415	1.15x	15.12%
NCF:2*1	256.874	283.381	1.10x	10.32%
NCF:3*1	18.2597	11.2339	0.62x	-38.48%
Alexnet:ip1*1	320.218	3628.01	11.33x	1033.08%
Alexnet:ip2*1	351.288	3128.06	8.91x	790.68%
Alexnet:ip3*1	443.303	3080.25	6.95x	594.84%
masknet:ip1*1	522.195	2922.14	5.60x	459.59%
masknet:ip2*1	1056.25	1817.74	1.72x	72.11%
masknet:ip3*1	1539.48	1495.31	0.97x	-2.87%
masknet:ip4*1	420.456	1130.4	2.69x	168.85%
RNN-T:Encoder_cell1_Input*2	499.792	741.557	1.48x	48.37%
RNN-T:Encoder_cell1_Hidden*11	357.773	2010.39	5.62x	462.03%
RNN-T:Encoder_cell3_Input*1	335.125	2653.63	7.92x	692.06%
RNN-T:Prediction_Input*12	859.962	891.03	1.04x	3.61%
RNN-T:JointNet_Linear1*3	1223.47	1658.04	1.36x	35.52%
RNN-T:JointNet_Linear2*3	122.171	395.666	3.24x	223.86%
DLRM:0*1	43.4487	43.4487	1.00x	0.00%
DLRM:1*2	1048.49	1223.28	1.17x	16.68%
DLRM:2*1	557.495	641.958	1.15x	15.15%
DLRM:3*1	1221.58	1232.88	1.01x	0.92%
DLRM:4*1	1308.72	2032.19	1.55x	55.30%
DLRM:5*1	1795.56	1938.74	1.08x	7.97%
DLRM:7*1	30.5793	21.7662	0.71x	-28.82%
BERT:MM_5*24	1024.66	919.494	0.90x	-10.26%
Transformer_lt:Encoder_MM_5*6	131.099	123.769	0.94x	-5.59%
Transformer_lt:Decoder_MM_5*240	84.6122	88.1706	1.04x	4.21%
Transformer_lt:Decoder_MM_yy20*240	20.7003	22.0387	1.06x	6.46%

I have run others too, that is also giving the performance benefit.
perf_matmul_training

Details report of Perf Matmul Training

Operation Name	Original GFLOPs	New GFLOPs	Improvement (%)
GNMT_train:FWD,0*1	255.796	577.93	125.93%
GNMT_train:FWD,1*1	213.356	1181.14	453.59%
GNMT_train:BWD_D,0*1	315.642	590.895	87.21%
GNMT_train:BWD_D,1*1	273.425	1191.61	335.81%
GNMT_train:BWD_W,0*1	292.157	335.077	14.69%
GNMT_train:BWD_W,1*1	384.289	411.557	7.10%
WnD-40_train:FWD,0*1	436.866	758.85	73.70%
WnD-40_train:FWD,1*1	406.974	532.134	30.75%
WnD-40_train:FWD,2*1	229.095	251.684	9.86%
WnD-40_train:BWD_D,0*1	272.789	812.602	197.88%
WnD-40_train:BWD_D,1*1	418.013	599.863	43.50%
WnD-40_train:BWD_D,2*1	225.569	226.982	0.63%
WnD-40_train:BWD_W,0*1	134.92	129.459	-4.05%
WnD-40_train:BWD_W,1*1	126.821	126.808	-0.01%
WnD-40_train:BWD_W,2*1	93.1004	101.648	9.18%
resnet_train:FWD,ip1*1	87.9965	1881.27	2037.99%
resnet_sparse_train:FWD,ip1*1	111.89	1637.81	1363.88%
resnet_train:BWD_D,ip1*1	107.769	1279.43	1087.17%
resnet_sparse_train:BWD_D,ip1*1	126.008	1170.93	829.25%
resnet_train:BWD_W,ip1*1	382.386	388.569	1.62%
resnet_sparse_train:BWD_W,ip1*1	225.692	225.923	0.10%
googlenet_v1_train:FWD,ip1*1	90.5023	1493.63	1550.38%
googlenet_v1_train:FWD,ip2*1	155.082	1451.61	836.13%
inceptionv3_train:FWD,ip1*1	135.547	2019.27	1389.70%
googlenet_v1_train:BWD_D,ip1*1	109.674	1331.91	1114.51%
googlenet_v1_train:BWD_D,ip2*1	220.256	1174.7	433.25%
inceptionv3_train:BWD_D,ip1*1	146.799	1407.31	858.80%
googlenet_v1_train:BWD_W,ip1*1	426.755	421.418	-1.25%
googlenet_v1_train:BWD_W,ip2*1	395.216	424.881	7.50%
inceptionv3_train:BWD_W,ip1*1	709.11	716.597	1.06%
VGG16_train:FWD,ip1*1	52.7683	4059.26	7592.56%
VGG16_train:FWD,ip2*1	67.6152	2540.64	3656.93%
VGG16_train:FWD,ip3*1	48.1058	458.098	852.28%
VGG16_train:FWD,ip4*1	122.528	713.05	482.10%
VGG16_train:BWD_D,ip1*1	78.8694	3049.83	3766.55%
VGG16_train:BWD_D,ip2*1	80.3912	2715.68	3277.72%
VGG16_train:BWD_D,ip3*1	144.421	242.304	67.78%
VGG16_train:BWD_D,ip4*1	262.371	790.497	201.29%
VGG16_train:BWD_W,ip1*1	197.683	215.704	9.12%
VGG16_train:BWD_W,ip2*1	218.549	223.991	2.49%
VGG16_train:BWD_W,ip3*1	187.881	197.001	4.85%
VGG16_train:BWD_W,ip4*1	221.383	222.324	0.42%
NCF_train:FWD,0*1	659.18	747.211	13.35%
NCF_train:FWD,1*1	556.694	684.544	22.97%
NCF_train:FWD,2*1	256.06	330.387	29.03%
NCF_train:FWD,3*1	18.1384	19.466	7.32%
NCF_train:BWD_D,0*1	704.312	750.246	6.52%
NCF_train:BWD_D,1*1	388.405	401.794	3.45%
NCF_train:BWD_D,2*1	188.175	191.035	1.52%
NCF_train:BWD_D,3*1	3.19066	3.15545	-1.10%
NCF_train:BWD_W,0*1	293.183	906.591	209.22%
NCF_train:BWD_W,1*1	170.72	648.933	280.11%
NCF_train:BWD_W,2*1	49.5902	444.603	796.56%
NCF_train:BWD_W,3*1	3.79355	5.4827	44.53%

Please let me know if you need anything else, i have run with the numactl. I have align numa node on my system and then ran this tests.

Tiwari-Avanish · 2025-06-03T19:27:40Z

Hi @spalicki
I have updated the perf report that you have asked with numa node align. Could you please have a look into the comment and review it.
If anything missing from my side please let me know.
Thanks for guiding me with this PR.

src/cpu/matmul/gemm_x8s8s32x_matmul.cpp

Tiwari-Avanish · 2025-06-04T04:09:38Z

Hi @spalicki,

I have made the changes that you have suggested, and rebased it to the main branch. I have fixed the linter error as well, please have a look once.

Thanks @spalicki for reviewing the PR.

Tiwari-Avanish · 2025-06-04T22:14:35Z

Hi @spalicki
Could you please have a look into this PR. I have made a changes that you have suggested and also done the rebase. Please look into this, if anything required please let me know.

Tiwari-Avanish · 2025-06-05T17:39:21Z

Hi @spalicki

Whenever you will get a time, please look the PR once.
Thanks for your guidance.

Tiwari-Avanish · 2025-06-06T06:02:02Z

Hi @spalicki

Thanks for approving this changes.

The PR expects at-least two reviewers to approve the changes to merge into the main branch. Can you please suggest/add additional reviewer to review and approve the changes.

Tiwari-Avanish · 2025-06-11T03:27:20Z

Hi @spalicki

Could you please add or suggest somebody to review this PR. I will add them to this PR.

dzarukin · 2025-06-11T05:45:42Z

src/CMakeLists.txt


 if(DNNL_EXPERIMENTAL_UKERNEL)
-    if(DNNL_TARGET_ARCH STREQUAL "X64" OR DNNL_TARGET_ARCH STREQUAL "AARCH64")
+    if(DNNL_TARGET_ARCH STREQUAL "X64" OR DNNL_TARGET_ARCH STREQUAL "AARCH64" OR DNNL_TARGET_ARCH STREQUAL "PPC64")


Unresolving this one.
It seems that the change that must be done is on PyTorch side instead - if the target platform is PPC64, DNNL_EXPERIMENTAL_UKERNEL must be turned off on their side of enabling oneDNN.
This macro is responsible to enable API and might lead to misleading expectations.

dzarukin · 2025-06-11T05:47:02Z

src/common/utils.hpp

+#else
+    return false;
+#endif
+}


cpu/platform.hpp might be a better place for such function.

Thanks @dzarukin i have push this method into the platform.hpp.
Thanks for suggesting to move DNNL_EXPERIMENTAL_UKERNEL to the pytorch side, i have removed it from the check for ppc from CmakeLists.txt

src/cpu/gemm/gemm.cpp

dzarukin · 2025-06-11T05:54:20Z

src/cpu/reorder/cpu_reorder_regular_f32_u8.cpp

            DNNL_AARCH64_ONLY(CPU_REORDER_INSTANCE(aarch64::jit_uni_reorder_t))

-            REG_FAST_DIRECT_COPY(f32, u8)
+	    DNNL_PPC64_ONLY(CPU_REORDER_INSTANCE(ppc64::ppc64_matrixA_reorder_t))


If you target classic quantization case, it may happen that user has already quantized data (or in x8 data type) and this reorder should support a format change within x8 data type since it's a packing routine.
It's not blocking the PR, it's just for your information that additional extension might be required for PyTorch integration.

yes you are right about these. So, when i am running through VLLM with onednn backend than we dont have any reorder, because conversion is happening from fp32 to int8 in vllm itself.
But for pytorch dynamic quantization, i was observing that for input tensor it is calling reorder from fp32 to uint8 in onednn. That's why i have written for the power, and it gave some performance benefit.

Tiwari-Avanish · 2025-06-12T18:26:49Z

Thanks @dzarukin for reviewing this, i have done the changes that you have asked for please review it. If any changes required please let me know.

Tiwari-Avanish · 2025-06-17T15:45:25Z

Hi @dzarukin

I have made the changes that you have asked. Could you please look into the changes that i have made based on your previous review.
Whenever you are free please have a look once, and let me know if i need to do any changes.

src/cpu/gemm/gemm.cpp

dzarukin · 2025-06-18T01:09:53Z

Thank you for the waiting and sorry it took longer to approve it.

Tiwari-Avanish · 2025-06-18T07:27:22Z

Thanks @dzarukin @spalicki for reviewing and approving this PR.

Could you please help me with merging this changes.

vpirogov · 2025-09-19T22:21:20Z

This PR was reverted as it breaks compatibility with PowerPC systems without mma support. Changes are preserved in #3974.

Tiwari-Avanish · 2025-09-21T20:10:35Z

@spalicki @vpirogov
I will fix the issue and will restore again this change #3974.

Tiwari-Avanish requested a review from a team as a code owner April 24, 2025 03:46

github-actions bot added platform:cpu-ppc64 PowerPC component:common labels Apr 24, 2025

spalicki suggested changes Apr 25, 2025

View reviewed changes

src/cpu/ppc64/ppc64_gemm_reorder.hpp Outdated Show resolved Hide resolved

src/cpu/ppc64/gemm/gemm_driver.hpp Outdated Show resolved Hide resolved

Tiwari-Avanish force-pushed the power_changes_onednn branch from 1a73179 to 994481a Compare May 5, 2025 04:43

Tiwari-Avanish changed the title ~~Matmul layer optimziation for power~~ cpu: ppc64: add gemm and reorder kernels May 5, 2025

Tiwari-Avanish force-pushed the power_changes_onednn branch 2 times, most recently from 4d3bfdd to 75ba7ff Compare May 5, 2025 05:59

spalicki reviewed May 5, 2025

View reviewed changes

Tiwari-Avanish force-pushed the power_changes_onednn branch from 75ba7ff to 7b2f1a8 Compare May 18, 2025 14:59

Tiwari-Avanish requested a review from spalicki May 18, 2025 18:27

Tiwari-Avanish force-pushed the power_changes_onednn branch from 7b2f1a8 to d263cbb Compare May 30, 2025 18:16

spalicki reviewed Jun 3, 2025

View reviewed changes

src/cpu/matmul/gemm_x8s8s32x_matmul.cpp Outdated Show resolved Hide resolved

Tiwari-Avanish force-pushed the power_changes_onednn branch 5 times, most recently from b852e19 to 4d6be89 Compare June 4, 2025 04:05

Tiwari-Avanish requested a review from spalicki June 4, 2025 08:53

spalicki approved these changes Jun 6, 2025

View reviewed changes

dzarukin reviewed Jun 11, 2025

View reviewed changes

cpu: ppc64: add gemm and reorder kernels

9334ebc

Tiwari-Avanish force-pushed the power_changes_onednn branch from 4d6be89 to 9334ebc Compare June 12, 2025 18:14

Tiwari-Avanish mentioned this pull request Jun 17, 2025

[PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu pytorch/pytorch#155255

Closed

Tiwari-Avanish requested review from dzarukin and spalicki June 17, 2025 16:30

dzarukin approved these changes Jun 18, 2025

View reviewed changes

src/cpu/gemm/gemm.cpp Show resolved Hide resolved

dzarukin merged commit cd2e2c9 into uxlfoundation:main Jun 18, 2025
22 checks passed

This was referenced Aug 12, 2025

[PPC64] '-mmma' requiring instructions not guarded by __MMA__ #3749

Closed

cmake: add "-mcpu=power10 -mmma" to cross-compilation flags #3773

Merged

xiazhuozhao mentioned this pull request Aug 28, 2025

cpu: riscv: gemm: add f32 gemm SIMD optimization with RISC-V V Extension #3785

Merged

This was referenced Sep 19, 2025

Restore power8 compatibility #3968

Merged

cpu: ppc64: add gemm and reorder kernels #3974

Closed

Tiwari-Avanish mentioned this pull request Sep 25, 2025

cpu:ppc64: fix GEMM reorder build issue on Power system #4002

Merged

Conversation

Tiwari-Avanish commented Apr 24, 2025

Description

Uh oh!

spalicki left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

spalicki May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Tiwari-Avanish May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dzarukin Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Tiwari-Avanish commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spalicki commented May 19, 2025

Uh oh!

Tiwari-Avanish commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tiwari-Avanish commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tiwari-Avanish commented Jun 3, 2025

Uh oh!

Uh oh!

Tiwari-Avanish commented Jun 4, 2025

Uh oh!

Tiwari-Avanish commented Jun 4, 2025

Uh oh!

Tiwari-Avanish commented Jun 5, 2025

Uh oh!

Tiwari-Avanish commented Jun 6, 2025

Uh oh!

Tiwari-Avanish commented Jun 11, 2025

Uh oh!

dzarukin Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

dzarukin Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Tiwari-Avanish Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dzarukin Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Tiwari-Avanish Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Tiwari-Avanish commented Jun 12, 2025

Uh oh!

Tiwari-Avanish commented Jun 17, 2025

Uh oh!

Uh oh!

dzarukin commented Jun 18, 2025

Uh oh!

Tiwari-Avanish commented Jun 18, 2025

Uh oh!

Uh oh!

vpirogov commented Sep 19, 2025

Uh oh!

Tiwari-Avanish commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

spalicki left a comment •

edited

Loading

Tiwari-Avanish May 7, 2025 •

edited

Loading

Tiwari-Avanish commented May 18, 2025 •

edited

Loading

Tiwari-Avanish commented May 20, 2025 •

edited

Loading

Tiwari-Avanish commented May 30, 2025 •

edited

Loading