Skip to content

Conversation

@bstefanuk
Copy link
Contributor

@bstefanuk bstefanuk commented Oct 14, 2025

Motivation

Performance regressions are found when adding --no-enumerate to the TensileCreateLibrary build.

Technical Evaluation

Conceptually, it doesn't make sense that device enumeration is required at build time since all requested architectures should be configured at the build invocation. However, I have confirmed that adding --no-enumerate does result in a performance regression.

The reason for this performance regression is due to an inconsistent strategy for determining the ISA when building. Historically, the is caused by the dual nature of many functions in Tensile whereby they are used both for tuning (which includes building & benchmarking), and building. While tuning, device enumeration is expected since benchmarking requires code-objects to be consistent with the device available on the machine. However, this has resulted in cases where the global "CurrentISA", which is set during device enumeration, is used as the authoritative build ISA. For example, in KernelWriter.py (L215 in develop), we see:

    currentIsa = globalParameters["CurrentISA"]
    maxVmcnt = globalParameters["AsmCaps"][currentIsa]["MaxVmcnt"]

When enumeration is enabled, maxVmcnt will be set using the ISA of the device(s) on the machine not the ISA of the target build architecture. Under the condition that the target build architecture is the same as the machine's installed device, this is fine, and tends to be a common scenario. However, when --no-enumerate is set, globalParameters["CurrentISA"] is unconditionally (0, 0, 0), which leads to improper determination of the capabilities of the target architecture. Due to the complexity within the code-generation steps, I cannot say exactly how, but my expectation is that this is leading to alternate code-flow during code-gen, resulting in assembly commands that aren't optimized for the target build arch.

Technical Details

  1. Removes the use of "CurrentISA" in the build code and instead relies on the kernel's ISA directly.
  2. If the kernel doesn't have an ISA defined on its state, the build system will raise a ValueError
  3. Point 1 & 2 above require that all KernelLanguage: Assembly solutions in logic files have properly configured ISA values. Unfortunately, this means that the correct solution is to update logic files.

Test Plan

  • Local performance testing for specific sizes
  • Comprehensive performance testing through gemmaiperf

Test Result

In progress...

Submission Checklist

@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

❗ There is a different number of reports uploaded between BASE (5e012e7) and HEAD (febbf19). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (5e012e7) HEAD (febbf19)
hipSPARSE 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #2094       +/-   ##
============================================
- Coverage    88.79%   67.11%   -21.68%     
============================================
  Files          301      360       +59     
  Lines        25768    50357    +24589     
  Branches         0     5665     +5665     
============================================
+ Hits         22879    33795    +10916     
- Misses        2889    13013    +10124     
- Partials         0     3549     +3549     
Flag Coverage Δ
hipSPARSE ?
rocBLAS 67.11% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.
see 661 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@TorreZuk TorreZuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will also need fat binary build perf analysis. gfxall can do that type of build but not for PTS. This is same pattern as #1636 so questions on perf impact if existing GPU was being benchmarked, and if any config with ISA 0,0,0 are not advising kernel generation correctly. Tensile team will have to advise.

Comment on lines +218 to +220
if(NOT Tensile_ENUMERATE)
set(Options ${Options} "--no-enumerate")
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safer to opt in via rocblas build to new behaviour so community perf builds are same as before

@@ -359,7 +358,7 @@ class MFMAInst (Inst):
"""
def __init__(self,kernel,aIdx,bIdx,PLRval,innerUnroll):
self.endLine = ""
self.version = globalParameters["CurrentISA"]
self.version = kernel["ISA"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this kernel value set for target or based on config?

Comment on lines 215 to 216
currentIsa = kernel["ISA"]
maxVmcnt = globalParameters["AsmCaps"][currentIsa]["MaxVmcnt"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good if this is always the current build gfx.

Comment on lines +491 to +492
# This won't affect the ISA for code-gen only for post-build asm kernels
globalParameters["CurrentISA"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this prep asm was another question, I guess it worked with default ISA

@bstefanuk bstefanuk requested a review from a team as a code owner October 16, 2025 21:07
@bstefanuk bstefanuk marked this pull request as draft October 16, 2025 21:07
@bstefanuk
Copy link
Contributor Author

Replaced by #2162

@bstefanuk bstefanuk closed this Oct 22, 2025
bstefanuk added a commit that referenced this pull request Nov 6, 2025
## Motivation

Performance regressions are found when adding --no-enumerate to the
TensileCreateLibrary build. This PR re-implements the kernel ISA
reliance from #2094 without needing to change logic files.

## Technical Details

- Rely only on the kernel's ISA during the build phase.
- Add additional ISA enforcement given architecture details extracted
from logic files.

## Test Plan

- Local performance testing for specific sizes
- Comprehensive performance testing through gemmaiperf
- Standard CI testing

## Test Result

- See CI results in this PR for standard pipeline checks.
- Performance: tested on 6665 sizes using `rocblas-bench` on gfx950
(results below)
- Performance: select sizes were evaluated on gfx942 and confirmed no
performance change beyond +/-1%

`Single precision NN`

Stat | Result
-- | --
Average (% speed up) | 0.50
Median  (% speed up) | 0.01
Count Faster | 3482
Count Slower | 3161

`Single precision TN`

Stat | Result
-- | --
Average (% speed up) | 4.17
Median (%speed up) | -0.02
Count Faster | 3042
Count Slower | 3579

`Complex double precision TN`

Stat | Result
-- | --
Average (% speed up) | 0.18
Median (% speed up) | 0.04
Count Faster | 4452
Count Slower | 2207

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
bstefanuk added a commit to bstefanuk/rocm-libraries that referenced this pull request Nov 7, 2025
Performance regressions are found when adding --no-enumerate to the
TensileCreateLibrary build. This PR re-implements the kernel ISA
reliance from ROCm#2094 without needing to change logic files.

- Rely only on the kernel's ISA during the build phase.
- Add additional ISA enforcement given architecture details extracted
from logic files.

- Local performance testing for specific sizes
- Comprehensive performance testing through gemmaiperf
- Standard CI testing

- See CI results in this PR for standard pipeline checks.
- Performance: tested on 6665 sizes using `rocblas-bench` on gfx950
(results below)
- Performance: select sizes were evaluated on gfx942 and confirmed no
performance change beyond +/-1%

`Single precision NN`

Stat | Result
-- | --
Average (% speed up) | 0.50
Median  (% speed up) | 0.01
Count Faster | 3482
Count Slower | 3161

`Single precision TN`

Stat | Result
-- | --
Average (% speed up) | 4.17
Median (%speed up) | -0.02
Count Faster | 3042
Count Slower | 3579

`Complex double precision TN`

Stat | Result
-- | --
Average (% speed up) | 0.18
Median (% speed up) | 0.04
Count Faster | 4452
Count Slower | 2207

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants