-
Notifications
You must be signed in to change notification settings - Fork 165
[tensile] Rely only on kernel ISA with no enumerate #2094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tensile] Rely only on kernel ISA with no enumerate #2094
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## develop #2094 +/- ##
============================================
- Coverage 88.79% 67.11% -21.68%
============================================
Files 301 360 +59
Lines 25768 50357 +24589
Branches 0 5665 +5665
============================================
+ Hits 22879 33795 +10916
- Misses 2889 13013 +10124
- Partials 0 3549 +3549
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
TorreZuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will also need fat binary build perf analysis. gfxall can do that type of build but not for PTS. This is same pattern as #1636 so questions on perf impact if existing GPU was being benchmarked, and if any config with ISA 0,0,0 are not advising kernel generation correctly. Tensile team will have to advise.
| if(NOT Tensile_ENUMERATE) | ||
| set(Options ${Options} "--no-enumerate") | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safer to opt in via rocblas build to new behaviour so community perf builds are same as before
shared/tensile/Tensile/Code.py
Outdated
| @@ -359,7 +358,7 @@ class MFMAInst (Inst): | |||
| """ | |||
| def __init__(self,kernel,aIdx,bIdx,PLRval,innerUnroll): | |||
| self.endLine = "" | |||
| self.version = globalParameters["CurrentISA"] | |||
| self.version = kernel["ISA"] | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this kernel value set for target or based on config?
| currentIsa = kernel["ISA"] | ||
| maxVmcnt = globalParameters["AsmCaps"][currentIsa]["MaxVmcnt"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good if this is always the current build gfx.
| # This won't affect the ISA for code-gen only for post-build asm kernels | ||
| globalParameters["CurrentISA"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this prep asm was another question, I guess it worked with default ISA
…fanuk/rocm-libraries into bug/tensile-build-with-no-enumerate
|
Replaced by #2162 |
## Motivation Performance regressions are found when adding --no-enumerate to the TensileCreateLibrary build. This PR re-implements the kernel ISA reliance from #2094 without needing to change logic files. ## Technical Details - Rely only on the kernel's ISA during the build phase. - Add additional ISA enforcement given architecture details extracted from logic files. ## Test Plan - Local performance testing for specific sizes - Comprehensive performance testing through gemmaiperf - Standard CI testing ## Test Result - See CI results in this PR for standard pipeline checks. - Performance: tested on 6665 sizes using `rocblas-bench` on gfx950 (results below) - Performance: select sizes were evaluated on gfx942 and confirmed no performance change beyond +/-1% `Single precision NN` Stat | Result -- | -- Average (% speed up) | 0.50 Median (% speed up) | 0.01 Count Faster | 3482 Count Slower | 3161 `Single precision TN` Stat | Result -- | -- Average (% speed up) | 4.17 Median (%speed up) | -0.02 Count Faster | 3042 Count Slower | 3579 `Complex double precision TN` Stat | Result -- | -- Average (% speed up) | 0.18 Median (% speed up) | 0.04 Count Faster | 4452 Count Slower | 2207 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Performance regressions are found when adding --no-enumerate to the TensileCreateLibrary build. This PR re-implements the kernel ISA reliance from ROCm#2094 without needing to change logic files. - Rely only on the kernel's ISA during the build phase. - Add additional ISA enforcement given architecture details extracted from logic files. - Local performance testing for specific sizes - Comprehensive performance testing through gemmaiperf - Standard CI testing - See CI results in this PR for standard pipeline checks. - Performance: tested on 6665 sizes using `rocblas-bench` on gfx950 (results below) - Performance: select sizes were evaluated on gfx942 and confirmed no performance change beyond +/-1% `Single precision NN` Stat | Result -- | -- Average (% speed up) | 0.50 Median (% speed up) | 0.01 Count Faster | 3482 Count Slower | 3161 `Single precision TN` Stat | Result -- | -- Average (% speed up) | 4.17 Median (%speed up) | -0.02 Count Faster | 3042 Count Slower | 3579 `Complex double precision TN` Stat | Result -- | -- Average (% speed up) | 0.18 Median (% speed up) | 0.04 Count Faster | 4452 Count Slower | 2207 - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Motivation
Performance regressions are found when adding
--no-enumerateto the TensileCreateLibrary build.Technical Evaluation
Conceptually, it doesn't make sense that device enumeration is required at build time since all requested architectures should be configured at the build invocation. However, I have confirmed that adding
--no-enumeratedoes result in a performance regression.The reason for this performance regression is due to an inconsistent strategy for determining the ISA when building. Historically, the is caused by the dual nature of many functions in Tensile whereby they are used both for tuning (which includes building & benchmarking), and building. While tuning, device enumeration is expected since benchmarking requires code-objects to be consistent with the device available on the machine. However, this has resulted in cases where the global "CurrentISA", which is set during device enumeration, is used as the authoritative build ISA. For example, in KernelWriter.py (L215 in develop), we see:
When enumeration is enabled,
maxVmcntwill be set using the ISA of the device(s) on the machine not the ISA of the target build architecture. Under the condition that the target build architecture is the same as the machine's installed device, this is fine, and tends to be a common scenario. However, when--no-enumerateis set,globalParameters["CurrentISA"]is unconditionally(0, 0, 0), which leads to improper determination of the capabilities of the target architecture. Due to the complexity within the code-generation steps, I cannot say exactly how, but my expectation is that this is leading to alternate code-flow during code-gen, resulting in assembly commands that aren't optimized for the target build arch.Technical Details
ValueErrorKernelLanguage: Assemblysolutions in logic files have properly configuredISAvalues. Unfortunately, this means that the correct solution is to update logic files.Test Plan
Test Result
In progress...
Submission Checklist