cpu: ppc64: add gemm and reorder kernels#3156
Conversation
There was a problem hiding this comment.
For the issues with our CI:
- You need to change commit message to comply with
<scope>:[ <scope>:] <description>, e.g. in your case:cpu: ppc64: add gemm and reorder kernels - Format using clang-format; you can see how it is used in
.github/automation/clang-format.sh. Required changes will be applied in-place (the-iin the command), so you do not have to manually fix everything. - Header guards I have marked.
1a73179 to
994481a
Compare
4d3bfdd to
75ba7ff
Compare
src/CMakeLists.txt
Outdated
|
|
||
| if(DNNL_EXPERIMENTAL_UKERNEL) | ||
| if(DNNL_TARGET_ARCH STREQUAL "X64" OR DNNL_TARGET_ARCH STREQUAL "AARCH64") | ||
| if(DNNL_TARGET_ARCH STREQUAL "X64" OR DNNL_TARGET_ARCH STREQUAL "AARCH64" OR DNNL_TARGET_ARCH STREQUAL "PPC64") |
There was a problem hiding this comment.
Is the experimental uKernel API used somewhere in this patch? Or you want to add it for integration with other libraries?
There was a problem hiding this comment.
Without this change i was not able to build onednn on power system, so, i have added this line here.
I was running onednn by enabling with VLLM and pytorch but was not able to build, so with this changes i was able to build pytorch and vllm.
We want to use brgemm approach for the next phase, regarding that i have gone through the onednn document. But info i have found that is with jitted code. We don't have right now xbyak like jit assembler so,can we use c++ static routine with power Vector intrinsic code.
There was a problem hiding this comment.
Unresolving this one.
It seems that the change that must be done is on PyTorch side instead - if the target platform is PPC64, DNNL_EXPERIMENTAL_UKERNEL must be turned off on their side of enabling oneDNN.
This macro is responsible to enable API and might lead to misleading expectations.
75ba7ff to
7b2f1a8
Compare
|
Hi @spalicki As you have asked, i have collected the performance with onednn benchdnn. ./matmul_perf_cpp: With the changes: Without this changes: Other benchdnn i have run: With the PR changes: ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 8192x8192:8192x8192 ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 4096x4096:4096x4096 Without this pr changes: ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 8192x8192:8192x8192 ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 4096x4096:4096x4096 If you can see the execute time for gemm kernel for both cases, it is 1.9x faster than earlier onednn.
I have run thorugh the pytorch and vllm as well and it is giving me the performance boost. Please let me know if anything more is required. |
|
@Tiwari-Avanish Those 2 cases are not very representative of overall DL performance. We have prepared some batch files with cases extracted directly from some models that would be a better way of measuring performance difference. If you take a look at the You can run them as e.g. where This will give you csv files that you can load to spreadsheet or pandas and then easily compare with each other. This script can be compressed to several lines if using tools like Also, if you add |
|
Thanks @spalicki for giving me the step to collect the perf data. This is my first time to work with onednn so, it is taking some time. |
7b2f1a8 to
d263cbb
Compare
|
Hi @spalicki , So, i have run the onednn as you have mentioned above. With -DDNNL_TEST_SET=NIGHTLY arguments 6-7 tests cases was failing, but now that got fixed and i have push those changes as well. Tests Log: Output logs of
|
| Name | Original GFLOPs | Modified GFLOPs | Improvement (x) | Improvement (%) |
|---|---|---|---|---|
| GNMT:0*1 | 164.214 | 384.374 | 2.34x | 134.07% |
| GNMT:1*1 | 265.652 | 894.157 | 3.37x | 236.62% |
| WnD-512:0*1 | 772.359 | 1346.7 | 1.74x | 74.36% |
| WnD-512:1*1 | 784.183 | 1187.7 | 1.51x | 51.46% |
| WnD-512:2*1 | 453.494 | 760.867 | 1.68x | 67.78% |
| resnet:ip1*1 | 90.1254 | 1724 | 19.13x | 1812.98% |
| resnet_sparse:ip1*1 | 110.76 | 1480.19 | 13.36x | 1236.62% |
| googlenet_v1:ip1*1 | 92.1656 | 1412.88 | 15.33x | 1433.02% |
| googlenet_v1:ip2*1 | 261.521 | 1333.1 | 5.10x | 409.75% |
| inceptionv3:ip1*1 | 139.086 | 1977.31 | 14.22x | 1321.61% |
| VGG16:ip1*1 | 52.6008 | 4053.91 | 77.07x | 7606.19% |
| VGG16:ip2*1 | 65.5996 | 2542.89 | 38.76x | 3776.38% |
| VGG16:ip3*1 | 47.3232 | 394.87 | 8.34x | 734.36% |
| VGG16:ip4*1 | 121.101 | 683.832 | 5.65x | 464.76% |
| NCF:0*1 | 661.168 | 717.718 | 1.09x | 8.55% |
| NCF:1*1 | 558.028 | 642.415 | 1.15x | 15.12% |
| NCF:2*1 | 256.874 | 283.381 | 1.10x | 10.32% |
| NCF:3*1 | 18.2597 | 11.2339 | 0.62x | -38.48% |
| Alexnet:ip1*1 | 320.218 | 3628.01 | 11.33x | 1033.08% |
| Alexnet:ip2*1 | 351.288 | 3128.06 | 8.91x | 790.68% |
| Alexnet:ip3*1 | 443.303 | 3080.25 | 6.95x | 594.84% |
| masknet:ip1*1 | 522.195 | 2922.14 | 5.60x | 459.59% |
| masknet:ip2*1 | 1056.25 | 1817.74 | 1.72x | 72.11% |
| masknet:ip3*1 | 1539.48 | 1495.31 | 0.97x | -2.87% |
| masknet:ip4*1 | 420.456 | 1130.4 | 2.69x | 168.85% |
| RNN-T:Encoder_cell1_Input*2 | 499.792 | 741.557 | 1.48x | 48.37% |
| RNN-T:Encoder_cell1_Hidden*11 | 357.773 | 2010.39 | 5.62x | 462.03% |
| RNN-T:Encoder_cell3_Input*1 | 335.125 | 2653.63 | 7.92x | 692.06% |
| RNN-T:Prediction_Input*12 | 859.962 | 891.03 | 1.04x | 3.61% |
| RNN-T:JointNet_Linear1*3 | 1223.47 | 1658.04 | 1.36x | 35.52% |
| RNN-T:JointNet_Linear2*3 | 122.171 | 395.666 | 3.24x | 223.86% |
| DLRM:0*1 | 43.4487 | 43.4487 | 1.00x | 0.00% |
| DLRM:1*2 | 1048.49 | 1223.28 | 1.17x | 16.68% |
| DLRM:2*1 | 557.495 | 641.958 | 1.15x | 15.15% |
| DLRM:3*1 | 1221.58 | 1232.88 | 1.01x | 0.92% |
| DLRM:4*1 | 1308.72 | 2032.19 | 1.55x | 55.30% |
| DLRM:5*1 | 1795.56 | 1938.74 | 1.08x | 7.97% |
| DLRM:7*1 | 30.5793 | 21.7662 | 0.71x | -28.82% |
| BERT:MM_5*24 | 1024.66 | 919.494 | 0.90x | -10.26% |
| Transformer_lt:Encoder_MM_5*6 | 131.099 | 123.769 | 0.94x | -5.59% |
| Transformer_lt:Decoder_MM_5*240 | 84.6122 | 88.1706 | 1.04x | 4.21% |
| Transformer_lt:Decoder_MM_yy20*240 | 20.7003 | 22.0387 | 1.06x | 6.46% |
I have run others too, that is also giving the performance benefit.
perf_matmul_training
Details report of Perf Matmul Training
| Operation Name | Original GFLOPs | New GFLOPs | Improvement (%) |
|---|---|---|---|
| GNMT_train:FWD,0*1 | 255.796 | 577.93 | 125.93% |
| GNMT_train:FWD,1*1 | 213.356 | 1181.14 | 453.59% |
| GNMT_train:BWD_D,0*1 | 315.642 | 590.895 | 87.21% |
| GNMT_train:BWD_D,1*1 | 273.425 | 1191.61 | 335.81% |
| GNMT_train:BWD_W,0*1 | 292.157 | 335.077 | 14.69% |
| GNMT_train:BWD_W,1*1 | 384.289 | 411.557 | 7.10% |
| WnD-40_train:FWD,0*1 | 436.866 | 758.85 | 73.70% |
| WnD-40_train:FWD,1*1 | 406.974 | 532.134 | 30.75% |
| WnD-40_train:FWD,2*1 | 229.095 | 251.684 | 9.86% |
| WnD-40_train:BWD_D,0*1 | 272.789 | 812.602 | 197.88% |
| WnD-40_train:BWD_D,1*1 | 418.013 | 599.863 | 43.50% |
| WnD-40_train:BWD_D,2*1 | 225.569 | 226.982 | 0.63% |
| WnD-40_train:BWD_W,0*1 | 134.92 | 129.459 | -4.05% |
| WnD-40_train:BWD_W,1*1 | 126.821 | 126.808 | -0.01% |
| WnD-40_train:BWD_W,2*1 | 93.1004 | 101.648 | 9.18% |
| resnet_train:FWD,ip1*1 | 87.9965 | 1881.27 | 2037.99% |
| resnet_sparse_train:FWD,ip1*1 | 111.89 | 1637.81 | 1363.88% |
| resnet_train:BWD_D,ip1*1 | 107.769 | 1279.43 | 1087.17% |
| resnet_sparse_train:BWD_D,ip1*1 | 126.008 | 1170.93 | 829.25% |
| resnet_train:BWD_W,ip1*1 | 382.386 | 388.569 | 1.62% |
| resnet_sparse_train:BWD_W,ip1*1 | 225.692 | 225.923 | 0.10% |
| googlenet_v1_train:FWD,ip1*1 | 90.5023 | 1493.63 | 1550.38% |
| googlenet_v1_train:FWD,ip2*1 | 155.082 | 1451.61 | 836.13% |
| inceptionv3_train:FWD,ip1*1 | 135.547 | 2019.27 | 1389.70% |
| googlenet_v1_train:BWD_D,ip1*1 | 109.674 | 1331.91 | 1114.51% |
| googlenet_v1_train:BWD_D,ip2*1 | 220.256 | 1174.7 | 433.25% |
| inceptionv3_train:BWD_D,ip1*1 | 146.799 | 1407.31 | 858.80% |
| googlenet_v1_train:BWD_W,ip1*1 | 426.755 | 421.418 | -1.25% |
| googlenet_v1_train:BWD_W,ip2*1 | 395.216 | 424.881 | 7.50% |
| inceptionv3_train:BWD_W,ip1*1 | 709.11 | 716.597 | 1.06% |
| VGG16_train:FWD,ip1*1 | 52.7683 | 4059.26 | 7592.56% |
| VGG16_train:FWD,ip2*1 | 67.6152 | 2540.64 | 3656.93% |
| VGG16_train:FWD,ip3*1 | 48.1058 | 458.098 | 852.28% |
| VGG16_train:FWD,ip4*1 | 122.528 | 713.05 | 482.10% |
| VGG16_train:BWD_D,ip1*1 | 78.8694 | 3049.83 | 3766.55% |
| VGG16_train:BWD_D,ip2*1 | 80.3912 | 2715.68 | 3277.72% |
| VGG16_train:BWD_D,ip3*1 | 144.421 | 242.304 | 67.78% |
| VGG16_train:BWD_D,ip4*1 | 262.371 | 790.497 | 201.29% |
| VGG16_train:BWD_W,ip1*1 | 197.683 | 215.704 | 9.12% |
| VGG16_train:BWD_W,ip2*1 | 218.549 | 223.991 | 2.49% |
| VGG16_train:BWD_W,ip3*1 | 187.881 | 197.001 | 4.85% |
| VGG16_train:BWD_W,ip4*1 | 221.383 | 222.324 | 0.42% |
| NCF_train:FWD,0*1 | 659.18 | 747.211 | 13.35% |
| NCF_train:FWD,1*1 | 556.694 | 684.544 | 22.97% |
| NCF_train:FWD,2*1 | 256.06 | 330.387 | 29.03% |
| NCF_train:FWD,3*1 | 18.1384 | 19.466 | 7.32% |
| NCF_train:BWD_D,0*1 | 704.312 | 750.246 | 6.52% |
| NCF_train:BWD_D,1*1 | 388.405 | 401.794 | 3.45% |
| NCF_train:BWD_D,2*1 | 188.175 | 191.035 | 1.52% |
| NCF_train:BWD_D,3*1 | 3.19066 | 3.15545 | -1.10% |
| NCF_train:BWD_W,0*1 | 293.183 | 906.591 | 209.22% |
| NCF_train:BWD_W,1*1 | 170.72 | 648.933 | 280.11% |
| NCF_train:BWD_W,2*1 | 49.5902 | 444.603 | 796.56% |
| NCF_train:BWD_W,3*1 | 3.79355 | 5.4827 | 44.53% |
Please let me know if you need anything else, i have run with the numactl. I have align numa node on my system and then ran this tests.
|
Hi @spalicki |
b852e19 to
4d6be89
Compare
|
Hi @spalicki |
|
Hi @spalicki Whenever you will get a time, please look the PR once. |
|
Hi @spalicki Thanks for approving this changes. The PR expects at-least two reviewers to approve the changes to merge into the main branch. Can you please suggest/add additional reviewer to review and approve the changes. |
|
Hi @spalicki Could you please add or suggest somebody to review this PR. I will add them to this PR. |
src/CMakeLists.txt
Outdated
|
|
||
| if(DNNL_EXPERIMENTAL_UKERNEL) | ||
| if(DNNL_TARGET_ARCH STREQUAL "X64" OR DNNL_TARGET_ARCH STREQUAL "AARCH64") | ||
| if(DNNL_TARGET_ARCH STREQUAL "X64" OR DNNL_TARGET_ARCH STREQUAL "AARCH64" OR DNNL_TARGET_ARCH STREQUAL "PPC64") |
There was a problem hiding this comment.
Unresolving this one.
It seems that the change that must be done is on PyTorch side instead - if the target platform is PPC64, DNNL_EXPERIMENTAL_UKERNEL must be turned off on their side of enabling oneDNN.
This macro is responsible to enable API and might lead to misleading expectations.
src/common/utils.hpp
Outdated
| #else | ||
| return false; | ||
| #endif | ||
| } |
There was a problem hiding this comment.
cpu/platform.hpp might be a better place for such function.
There was a problem hiding this comment.
Thanks @dzarukin i have push this method into the platform.hpp.
Thanks for suggesting to move DNNL_EXPERIMENTAL_UKERNEL to the pytorch side, i have removed it from the check for ppc from CmakeLists.txt
| DNNL_AARCH64_ONLY(CPU_REORDER_INSTANCE(aarch64::jit_uni_reorder_t)) | ||
|
|
||
| REG_FAST_DIRECT_COPY(f32, u8) | ||
| DNNL_PPC64_ONLY(CPU_REORDER_INSTANCE(ppc64::ppc64_matrixA_reorder_t)) |
There was a problem hiding this comment.
If you target classic quantization case, it may happen that user has already quantized data (or in x8 data type) and this reorder should support a format change within x8 data type since it's a packing routine.
It's not blocking the PR, it's just for your information that additional extension might be required for PyTorch integration.
There was a problem hiding this comment.
yes you are right about these. So, when i am running through VLLM with onednn backend than we dont have any reorder, because conversion is happening from fp32 to int8 in vllm itself.
But for pytorch dynamic quantization, i was observing that for input tensor it is calling reorder from fp32 to uint8 in onednn. That's why i have written for the power, and it gave some performance benefit.
4d6be89 to
9334ebc
Compare
|
Thanks @dzarukin for reviewing this, i have done the changes that you have asked for please review it. If any changes required please let me know. |
|
Hi @dzarukin I have made the changes that you have asked. Could you please look into the changes that i have made based on your previous review. |
|
Thank you for the waiting and sorry it took longer to approve it. |
|
This PR was reverted as it breaks compatibility with PowerPC systems without |
Description
Implented reorder for fp32 to u8 for matmul.
Implemented prepacking routine for input and output that will support u8 to s8 conversion as well, based on the data type of input.
Implemented gemm driver to run parallely the gemm kernel based on the block size of matrix A and B.
Improved the performance of gemm kernel.