Skip to content

cpu:ppc64: fix GEMM reorder build issue on Power system#4002

Merged
vpirogov merged 3 commits intouxlfoundation:mainfrom
Tiwari-Avanish:ppc64-gemm-reorder-fix
Oct 3, 2025
Merged

cpu:ppc64: fix GEMM reorder build issue on Power system#4002
vpirogov merged 3 commits intouxlfoundation:mainfrom
Tiwari-Avanish:ppc64-gemm-reorder-fix

Conversation

@Tiwari-Avanish
Copy link
Contributor

My previous changes #3156 got reverted back because of build issue on different power system.

This PR addresses the build issues for the GEMM reorder kernels on ppc64 architectures
including Power8, Power9, and Power10.

The changes include:

  • Fix alignment and vectorization issues in gemm_driver.cpp and ppc64_gemm_reorder.cpp
  • Update reorder logic in cpu_reorder_regular_f32_u8.cpp
  • Ensure compatibility with all supported Power architectures

Test Results:

Power10:

Build successfully:
[100%] Linking CXX executable benchdnn
[100%] Linking CXX executable test_binary
[100%] Built target test_binary
[100%] Linking CXX executable test_internals
[100%] Built target test_internals
[100%] Built target test_graph_unit
[100%] Built target benchdnn

Testcase:
338/338 Test #338: noexcept-cpp ............................................ Passed    0.04 sec
100% tests passed, 0 tests failed out of 338

Power 9:

Build successfully:


[100%] Built target test_internals
[100%] Linking CXX executable test_convolution_forward_f32
[100%] Built target test_convolution_forward_f32
[100%] Linking CXX executable test_binary
[100%] Linking CXX executable test_rnn_forward
[100%] Built target test_binary
[100%] Built target test_rnn_forward
[100%] Linking CXX executable benchdnn
[100%] Built target benchdnn
[100%] Linking CXX executable test_reorder
[100%] Built target test_reorder


TestCase:

338/338 Test #338: noexcept-cpp ............................................   Passed    0.01 sec

100% tests passed, 0 tests failed out of 338

Power8:

Build successfully:
[100%] Built target test_gemm_u8s8s32
[100%] Linking CXX executable test_internals
[100%] Built target test_rnn_forward
[100%] Built target test_graph_cpp_api_compile
[100%] Built target test_internals
[100%] Built target test_graph_unit
[100%] Linking CXX executable test_sum
[100%] Built target test_sum
[100%] Linking CXX executable benchdnn
[100%] Built target benchdnn

Testcase:
338/338 Test #338: noexcept-cpp ............................................ Passed    0.06 sec
100% tests passed, 0 tests failed out of 338

Related PRs:

Thanks to @vpirogov and @spalicki for helping identify and preserve the relevant changes.

Please review this PR. All changes have been tested locally across Power8, Power9, and Power10.

@Tiwari-Avanish Tiwari-Avanish requested a review from a team as a code owner September 25, 2025 04:35
@Tiwari-Avanish Tiwari-Avanish changed the title cpu:ppc64: fix GEMM reorder build issues on Power8, Power9 and Power10 cpu:ppc64: fix GEMM reorder build issue on Power system Sep 25, 2025
@Tiwari-Avanish
Copy link
Contributor Author

Thank you, @vpirogov, for reviewing the changes!

@spalicki, could you please review this PR when you have a moment? Let me know if any changes are needed, and I’ll address them promptly.

@dzarukin
Copy link
Contributor

Could you, please, elaborate this bullet?

Ensure compatibility with all supported Power architectures

I don't see changes to the build system. How this compatibility was achieved?

@spalicki
Copy link
Contributor

@Tiwari-Avanish Our conda build succeeds now.

For future reference to test it you can run:

# Download conda feedstock
git clone https://github.com/conda-forge/onednn-feedstock.git && cd onednn-feedstock
# Check SHA of your changes, e.g. for this change:
wget https://github.com/Tiwari-Avanish/oneDNN/archive/refs/heads/ppc64-gemm-reorder-fix.zip
sha256sum ppc64-gemm-reorder-fix.zip
# Add your change to the recipe
vim recipe/recipe.yaml
-  url: https://github.com/oneapi-src/${{ name }}/archive/v${{ version }}.tar.gz
-  sha256: fa44702f5979ed5ab927f7ccc1d2947adb4e6d0e58c433149465c5fc71e3bd45
+  url: https://github.com/Tiwari-Avanish/oneDNN/archive/refs/heads/ppc64-gemm-reorder-fix.zip
+  sha256: 5235d10691833f44041aac9a2f0afaf82aa2134c3a8dde63ebe88e0f47dcbdb4
# Build the package
./build-locally.py linux_ppc64le_

Currently the build shows 2 warnings that need to be fixed:

 │ │ $SRC_DIR/src/cpu/ppc64/ppc64_gemm_reorder.cpp:107:9: warning: unused variable 'i' [-Wunused-variable]
 │ │   107 |     int i = 0;

which is quite obvious, and

 │ │ $SRC_DIR/src/cpu/ppc64/gemm/gemm_driver.cpp:17:5: warning: "__MMA__" is not defined, evaluates to 0 [-Wundef]
 │ │    17 | #if __MMA__

which I assume is just a typo since in all other __MMA__ guards you correctly use #ifdef.

Please also check if the performance change is in line with your expectations on Power10 and there is no degradation in case of Power8/9. You do not need to run the entire perf test suite, since you tested that last time, just a smaller subset.

@Tiwari-Avanish
Copy link
Contributor Author

Tiwari-Avanish commented Sep 26, 2025

Hi @dzarukin

Thanks for reviewing this PR.

By “compatibility with all supported Power architectures,” I meant source-level compatibility:

  • Power10 (MMA + VSX): MMA paths are now guarded with #ifdef MMA.

  • Power7/8/9 (VSX-only): Reorder implementation rewritten with VSX intrinsics. Earlier code used Power10-only intrinsics unconditionally, causing build failures; this is now fixed.

  • Build system: No changes. Fallbacks (ref_gemm_s8x8s32() for GEMM, fp32_u8 reorder) are already handled in oneDNN.

So “compatibility” means correct compilation and execution across Power7–Power10 using conditional compilation, not build-system modifications.

Thanks @spalicki for reviewing this and checking the conda build system.

  • I’ll fix the unused variable warning and the MMA guard typo.

  • For performance validation: could you confirm if I should run the following perf tests only on Power10 (where MMA changes apply), or also on Power8/9?

    • perf_matmul_inference_batched
    • perf_matmul_inference_lb
    • perf_matmul_training

Previously, I ran these tests during performance validation, so I can rerun them if needed. My assumption is that Power8/9 performance should remain unchanged, but I’m happy to test if you’d like confirmation.

@Tiwari-Avanish Tiwari-Avanish force-pushed the ppc64-gemm-reorder-fix branch 2 times, most recently from 2aea3d4 to 7c8e525 Compare September 30, 2025 05:35
@Tiwari-Avanish
Copy link
Contributor Author

Tiwari-Avanish commented Sep 30, 2025

Hi @spalicki,

I’ve fixed the warning and corrected the typo.
I also checked for any performance degradation with these changes and did not observe any.

Top 15 Benchmark improvements for perf_matmul_inference_lb (earlier vs current changes):

Click to expand
Workload Reverted Code(With MMA) GFLOPS Reverted code(With MMA) + Power8 and 9 support GFLOPS Improvement
Transformer_lt:Decoder_vocabulary*40 1.6099 1.7071 +6.04%
Transformer_lt:Encoder_MM_1*18 1.6180 1.7147 +5.98%
Transformer_lt:Decoder_MM_1*720 1.6202 1.7151 +5.86%
Transformer_lt:Encoder_MM_2*12 1.6195 1.7137 +5.82%
Transformer_lt:Encoder_MM_6*6 1.6202 1.7137 +5.77%
Transformer_lt:Encoder_MM_7*6 1.6089 1.6999 +5.65%
Transformer_lt:Decoder_MM_6*480 1.6161 1.7069 +5.62%
Transformer_lt:Decoder_MM_2*240 1.6212 1.7117 +5.58%
Transformer_lt:Decoder_MM_8*240 1.6234 1.7110 +5.40%
Transformer_lt:Encoder_MM_8*6 1.6228 1.7099 +5.37%
Transformer_lt:Decoder_MM_7*240 1.6056 1.6898 +5.24%
Transformer_lt:Decoder_MM_xx20*240 0.9286 0.9545 +2.79%
DLRM:3*1 1251.86 1286.54 +2.77%
Transformer_lt:Encoder_MM_5*6 129.267 132.162 +2.24%
DLRM:4*1 2087.73 2134.24 +2.23%

Whenever you have a moment, please review this PR. If everything looks good, kindly help merge it.

@spalicki
Copy link
Contributor

spalicki commented Oct 1, 2025

My assumption is that Power8/9 performance should remain unchanged, but I’m happy to test if you’d like confirmation.

Yes, that is why I asked. Power 8/9 should remain unchanged with Power10 getting an uplift.

@vpirogov vpirogov force-pushed the ppc64-gemm-reorder-fix branch from 7c8e525 to aca5334 Compare October 3, 2025 19:48
@vpirogov vpirogov merged commit c8cd40f into uxlfoundation:main Oct 3, 2025
29 checks passed
@vpirogov
Copy link
Contributor

vpirogov commented Oct 3, 2025

Cherry-picked to rls-v3.10 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants