[rocRoller][hipblaslt]Use separate array for flags in rocRoller StreamK#3177
Merged
Conversation
added 9 commits
December 1, 2025 23:02
Generated Documentation |
Performance Report for gfx12ResultsDetails@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
- 0.10% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
- 0.09% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
- 0.20% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Sequential, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Sequential', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
- 0.37% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'half', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'half', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
- 2.06% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
+ 1.74% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: simple_mi)| CodeGen() | CodeGen(instCount: 40000, instructions: simple_mi)
|
Resource Report for gfx12ResultsDetails✔️ No Resource Usage Changes ✔️ |
Code Coverage Report for gfx942Summary
This PR adds/edits 2 newly uncovered lines. Artifacts
Commit Hashes |
Performance Report for gfx942ResultsDetails@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.04% | p=2.5347e-02
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.41% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 2.45% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 2.08% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 2.55% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 2.37% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.01% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 2.36% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.86% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 3.12% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.99% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.50% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.89% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.18% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 2.23% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.79% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 1.70% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 4.94% | p=5.6994e-05
| 3. FloatsGEMM(M: 8448, N: 8448, K: 128, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=8448, N=8448, K=128, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 7.78% | p=5.6994e-05
| 3. FloatsGEMM(M: 8448, N: 8448, K: 128, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=8448, N=8448, K=128, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.86% | p=2.2479e-02
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
- 0.68% | p=2.5347e-02
| CodeGen(instCount: 40000, instructions: complex_mi_with_coop)| CodeGen() | CodeGen(instCount: 40000, instructions: complex_mi_with_coop)
+ 1.39% | p=1.7451e-03
| CodeGen(instCount: 40000, instructions: simple_mi)| CodeGen() | CodeGen(instCount: 40000, instructions: simple_mi)
|
Resource Report for gfx942ResultsDetails@@ Resource Usage Changes @@
====================================================================================================
- SGPR: 68 -> 70 (+2)
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 72 -> 74 (+2)
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 72 -> 74 (+2)
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 84 -> 86 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 80 -> 82 (+2)Results truncated, see full report in workspace |
added 2 commits
December 6, 2025 05:56
Contributor
|
There was a test that Nathan enabled which uncovered a problem with multiple fixups -- I see you have a fix for that in this PR. Can you enable the test that Nathan found? |
added 2 commits
December 10, 2025 17:32
…braries into users/liu-yiqian/reset-flag
Contributor
Author
memmett
approved these changes
Dec 10, 2025
Performance Report for gfx950-rocrollerResultsDetails@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.10% | p=6.2764e-06
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.40% | p=1.0272e-07
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 1.06% | p=8.4176e-102
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.36% | p=1.2620e-51
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.04% | p=3.9597e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.56% | p=4.8146e-92
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [32, 8, 2], 'scaleShuffleTileB': [32, 8, 2]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [32, 8, 2], 'scaleShuffleTileB': [32, 8, 2]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.41% | p=1.3500e-08
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.22% | p=8.1060e-24
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.15% | p=7.7441e-06
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.52% | p=1.1861e-09
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.18% | p=1.8651e-26
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.23% | p=2.2561e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.29% | p=1.4393e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.34% | p=6.2764e-06
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.56% | p=1.4473e-14
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.20% | p=4.0937e-18
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.28% | p=1.8450e-37
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.08% | p=1.2002e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.27% | p=1.7451e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf6', 'type_B': 'bf6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf6', 'type_B': 'bf6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.40% | p=1.7451e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf6', 'type_B': 'bf6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf6', 'type_B': 'bf6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 1.57% | p=1.7451e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.88% | p=2.5347e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.42% | p=2.5347e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp6', 'type_B': 'fp6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp6', 'type_B': 'fp6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.16% | p=2.5347e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp6', 'type_B': 'fp6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp6', 'type_B': 'fp6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.36% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.24% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.51% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 7.19% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.40% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.95% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.28% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.93% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.30% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.77% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.78% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.59% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.24% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.60% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.56% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 1.16% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 2.22% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.48% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.70% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.18% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.96% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.63% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.21% | p=1.7451e-03
| 3. FloatsGEMM(M: 8448, N: 8448, K: 128, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Sequential, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=8448, N=8448, K=128, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Sequential', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.75% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
- 1.37% | p=2.5347e-02
| CodeGen(instCount: 40000, instructions: complex_mi_with_coop)| CodeGen() | CodeGen(instCount: 40000, instructions: complex_mi_with_coop)
|
Resource Report for gfx950-rocrollerResultsDetails@@ Resource Usage Changes @@
====================================================================================================
- SGPR: 60 -> 63 (+3)
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 63 -> 64 (+1)
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 63 -> 64 (+1)
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 72 -> 75 (+3)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 73 -> 75 (+2) | VGPR: 154 -> 152 (-2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 75 -> 76 (+1)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 72 -> 75 (+3)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 73 -> 75 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 75 -> 76 (+1)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 72 -> 75 (+3)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 72 -> 75 (+3)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 73 -> 75 (+2) | VGPR: 182 -> 180 (-2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 75 -> 76 (+1)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 73 -> 75 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 72 -> 75 (+3)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 73 -> 75 (+2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 72 -> 75 (+3)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 73 -> 75 (+2) | VGPR: 154 -> 152 (-2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 75 -> 76 (+1)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 72 -> 75 (+3)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 73 -> 75 (+2) | VGPR: 154 -> 152 (-2)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 75 -> 76 (+1)
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- SGPR: 73 -> 75 (+2)Results truncated, see full report in workspace |
bnemanich
reviewed
Dec 11, 2025
perfci run on commit 96262ce |
perfci run on commit 7aa9059 |
bnemanich
approved these changes
Dec 12, 2025
perfci run on commit 6cb5d65 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
We need to use a separate scratch space for flags in StreamK and guarantees they are all 0s before and after the kernel launch.
Technical Details
Test Plan
Test Result
Submission Checklist