[tensile] Rely only on kernel ISA with no enumerate #2094

bstefanuk · 2025-10-14T18:41:04Z

Motivation

Performance regressions are found when adding --no-enumerate to the TensileCreateLibrary build.

Technical Evaluation

Conceptually, it doesn't make sense that device enumeration is required at build time since all requested architectures should be configured at the build invocation. However, I have confirmed that adding --no-enumerate does result in a performance regression.

The reason for this performance regression is due to an inconsistent strategy for determining the ISA when building. Historically, the is caused by the dual nature of many functions in Tensile whereby they are used both for tuning (which includes building & benchmarking), and building. While tuning, device enumeration is expected since benchmarking requires code-objects to be consistent with the device available on the machine. However, this has resulted in cases where the global "CurrentISA", which is set during device enumeration, is used as the authoritative build ISA. For example, in KernelWriter.py (L215 in develop), we see:

    currentIsa = globalParameters["CurrentISA"]
    maxVmcnt = globalParameters["AsmCaps"][currentIsa]["MaxVmcnt"]

When enumeration is enabled, maxVmcnt will be set using the ISA of the device(s) on the machine not the ISA of the target build architecture. Under the condition that the target build architecture is the same as the machine's installed device, this is fine, and tends to be a common scenario. However, when --no-enumerate is set, globalParameters["CurrentISA"] is unconditionally (0, 0, 0), which leads to improper determination of the capabilities of the target architecture. Due to the complexity within the code-generation steps, I cannot say exactly how, but my expectation is that this is leading to alternate code-flow during code-gen, resulting in assembly commands that aren't optimized for the target build arch.

Technical Details

Removes the use of "CurrentISA" in the build code and instead relies on the kernel's ISA directly.
If the kernel doesn't have an ISA defined on its state, the build system will raise a ValueError
Point 1 & 2 above require that all KernelLanguage: Assembly solutions in logic files have properly configured ISA values. Unfortunately, this means that the correct solution is to update logic files.

Test Plan

Local performance testing for specific sizes
Comprehensive performance testing through gemmaiperf

Test Result

In progress...

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

codecov-commenter · 2025-10-14T19:40:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

❗ There is a different number of reports uploaded between BASE (5e012e7) and HEAD (febbf19). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (5e012e7) HEAD (febbf19)

hipSPARSE 1 0

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #2094       +/-   ##
============================================
- Coverage    88.79%   67.11%   -21.68%     
============================================
  Files          301      360       +59     
  Lines        25768    50357    +24589     
  Branches         0     5665     +5665     
============================================
+ Hits         22879    33795    +10916     
- Misses        2889    13013    +10124     
- Partials         0     3549     +3549

Flag	Coverage Δ
hipSPARSE	`?`
rocBLAS	`67.11% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.
see 661 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TorreZuk

Will also need fat binary build perf analysis. gfxall can do that type of build but not for PTS. This is same pattern as #1636 so questions on perf impact if existing GPU was being benchmarked, and if any config with ISA 0,0,0 are not advising kernel generation correctly. Tensile team will have to advise.

TorreZuk · 2025-10-14T21:57:05Z

shared/tensile/Tensile/cmake/TensileConfig.cmake

+  if(NOT Tensile_ENUMERATE)
+    set(Options ${Options} "--no-enumerate")
+  endif()


Safer to opt in via rocblas build to new behaviour so community perf builds are same as before

TorreZuk · 2025-10-14T21:58:46Z

shared/tensile/Tensile/Code.py

@@ -359,7 +358,7 @@ class  MFMAInst (Inst):
  """
  def  __init__(self,kernel,aIdx,bIdx,PLRval,innerUnroll):
       self.endLine = ""
-       self.version = globalParameters["CurrentISA"]
+       self.version = kernel["ISA"]


is this kernel value set for target or based on config?

TorreZuk · 2025-10-14T22:58:11Z

shared/tensile/Tensile/KernelWriter.py

+    currentIsa = kernel["ISA"]
    maxVmcnt = globalParameters["AsmCaps"][currentIsa]["MaxVmcnt"]


Looks good if this is always the current build gfx.

TorreZuk · 2025-10-14T23:00:37Z

shared/tensile/Tensile/TensileCreateLibrary.py

+        # This won't affect the ISA for code-gen only for post-build asm kernels
+        globalParameters["CurrentISA"], 


So this prep asm was another question, I guess it worked with default ISA

…fanuk/rocm-libraries into bug/tensile-build-with-no-enumerate

bstefanuk · 2025-10-22T18:59:09Z

Replaced by #2162

## Motivation Performance regressions are found when adding --no-enumerate to the TensileCreateLibrary build. This PR re-implements the kernel ISA reliance from #2094 without needing to change logic files. ## Technical Details - Rely only on the kernel's ISA during the build phase. - Add additional ISA enforcement given architecture details extracted from logic files. ## Test Plan - Local performance testing for specific sizes - Comprehensive performance testing through gemmaiperf - Standard CI testing ## Test Result - See CI results in this PR for standard pipeline checks. - Performance: tested on 6665 sizes using `rocblas-bench` on gfx950 (results below) - Performance: select sizes were evaluated on gfx942 and confirmed no performance change beyond +/-1% `Single precision NN` Stat | Result -- | -- Average (% speed up) | 0.50 Median (% speed up) | 0.01 Count Faster | 3482 Count Slower | 3161 `Single precision TN` Stat | Result -- | -- Average (% speed up) | 4.17 Median (%speed up) | -0.02 Count Faster | 3042 Count Slower | 3579 `Complex double precision TN` Stat | Result -- | -- Average (% speed up) | 0.18 Median (% speed up) | 0.04 Count Faster | 4452 Count Slower | 2207 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Performance regressions are found when adding --no-enumerate to the TensileCreateLibrary build. This PR re-implements the kernel ISA reliance from ROCm#2094 without needing to change logic files. - Rely only on the kernel's ISA during the build phase. - Add additional ISA enforcement given architecture details extracted from logic files. - Local performance testing for specific sizes - Comprehensive performance testing through gemmaiperf - Standard CI testing - See CI results in this PR for standard pipeline checks. - Performance: tested on 6665 sizes using `rocblas-bench` on gfx950 (results below) - Performance: select sizes were evaluated on gfx942 and confirmed no performance change beyond +/-1% `Single precision NN` Stat | Result -- | -- Average (% speed up) | 0.50 Median (% speed up) | 0.01 Count Faster | 3482 Count Slower | 3161 `Single precision TN` Stat | Result -- | -- Average (% speed up) | 4.17 Median (%speed up) | -0.02 Count Faster | 3042 Count Slower | 3579 `Complex double precision TN` Stat | Result -- | -- Average (% speed up) | 0.18 Median (% speed up) | 0.04 Count Faster | 4452 Count Slower | 2207 - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

draft: rely only on kernel ISA with no enumerate

5e379e9

bstefanuk requested review from a team as code owners October 14, 2025 18:41

github-actions bot added shared: tensile project: rocblas labels Oct 14, 2025

assistant-librarian bot added the organization: ROCm label Oct 14, 2025

bstefanuk added the runPerformance label Oct 14, 2025

Merge branch 'develop' into bug/tensile-build-with-no-enumerate

febbf19

TorreZuk reviewed Oct 14, 2025

View reviewed changes

bstefanuk added 15 commits October 15, 2025 22:45

temp: fail if 0,0,0 ISA is found or no ISA is found

981d68d

fix: vega10 logic files ISA

50461b0

fix: vega20 logic files ISA

c9650d6

fix: arcturus logic files ISA

f706a2d

fix: aldebaran logic files ISA

3cd3a1d

fix: aldebaran_104cu logic files ISA

542feac

fix: aldebaran logic files add ISA

a6d760e

fix: vega20 logic files add ISA

8cd2825

fix: vega10 logic files add ISA

f257d15

fix: arcturus logic files add ISA

26881e9

fix: gfx950 logic files add ISA

906816d

fix: aquavanjaram942 logic files add ISA

721c777

fix: hip logic files add ISA

b32d815

fix: only check ISA if assembly kernel

8305232

Merge branch 'bug/tensile-build-with-no-enumerate' of github.com:bste…

ed4caf4

…fanuk/rocm-libraries into bug/tensile-build-with-no-enumerate

bstefanuk requested a review from a team as a code owner October 16, 2025 21:07

bstefanuk marked this pull request as draft October 16, 2025 21:07

bstefanuk mentioned this pull request Oct 17, 2025

[rocblas][tensile] Use kernel ISA during build with enforcement #2162

Merged

1 task

bstefanuk closed this Oct 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tensile] Rely only on kernel ISA with no enumerate #2094

[tensile] Rely only on kernel ISA with no enumerate #2094

Uh oh!

bstefanuk commented Oct 14, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 14, 2025 •

edited

Loading

Uh oh!

TorreZuk left a comment

Uh oh!

TorreZuk Oct 14, 2025

Uh oh!

TorreZuk Oct 14, 2025

Uh oh!

TorreZuk Oct 14, 2025

Uh oh!

TorreZuk Oct 14, 2025

Uh oh!

bstefanuk commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		currentIsa = kernel["ISA"]
		maxVmcnt = globalParameters["AsmCaps"][currentIsa]["MaxVmcnt"]

		# This won't affect the ISA for code-gen only for post-build asm kernels
		globalParameters["CurrentISA"],

[tensile] Rely only on kernel ISA with no enumerate #2094

[tensile] Rely only on kernel ISA with no enumerate #2094

Uh oh!

Conversation

bstefanuk commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Evaluation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

codecov-commenter commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TorreZuk left a comment

Choose a reason for hiding this comment

Uh oh!

TorreZuk Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

TorreZuk Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

TorreZuk Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

TorreZuk Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

bstefanuk commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bstefanuk commented Oct 14, 2025 •

edited

Loading

codecov-commenter commented Oct 14, 2025 •

edited

Loading