[rocroller] Only use pullRequest property on PRs#2643
Merged
Conversation
Contributor
Code Coverage Report for gfx942Summary
Artifacts
Commit Hashes |
ThanHenderson
approved these changes
Nov 12, 2025
Contributor
Generated Documentation |
Contributor
Performance Report for gfx950-rocrollerResultsDetails@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
- 0.20% | p=2.5347e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.20% | p=1.3905e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.65% | p=2.1515e-05
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.81% | p=3.0255e-12
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 2.30% | p=6.1189e-17
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'SingleScale', 'scaleType_A': 'E8M0', 'scale_B': 'SingleScale', 'scaleType_B': 'E8M0', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 0.00999999978, scaleValue_B: 0.00999999978, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'SingleScale', 'scaleType_A': 'E8M0', 'scale_B': 'SingleScale', 'scaleType_B': 'E8M0', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=0.00999999978, scaleValue_B=0.00999999978, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.45% | p=1.8499e-37
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.25% | p=7.7895e-12
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.26% | p=6.2574e-08
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.39% | p=1.3906e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.41% | p=2.1989e-12
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.41% | p=7.9623e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.39% | p=4.3287e-13
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 4.67% | p=3.2632e-263
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadLDSScale_A: True, loadLDSScale_B: True, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadLDSScale_A=True, loadLDSScale_B=True, swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.95% | p=8.1134e-14
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadLDSScale_A: True, loadLDSScale_B: True, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadLDSScale_A=True, loadLDSScale_B=True, swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.04% | p=3.1823e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadLDSScale_A: True, loadLDSScale_B: True, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadLDSScale_A=True, loadLDSScale_B=True, swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.16% | p=1.0634e-11
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadLDSScale_A: True, loadLDSScale_B: True, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadLDSScale_A=True, loadLDSScale_B=True, swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.07% | p=1.5737e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadLDSScale_A: True, loadLDSScale_B: True, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadLDSScale_A=True, loadLDSScale_B=True, swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 2.97% | p=1.7451e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf6', 'type_B': 'bf6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf6', 'type_B': 'bf6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 1.43% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 4, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=64, workgroup_size_y=4, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.11% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.41% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 9.36% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.31% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 11.50% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 3.36% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.50% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.69% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.35% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.59% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 7.99% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 2.97% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.70% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.80% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 11.00% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 5.64% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.18% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 5.32% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.07% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 2.69% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 4.26% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 4.95% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 11.95% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 6.07% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.70% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 7.31% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.24% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.83% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.75% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
+ 0.80% | p=2.5347e-02
| CodeGen(instCount: 40000, instructions: complex_mi_with_coop)| CodeGen() | CodeGen(instCount: 40000, instructions: complex_mi_with_coop)
|
Contributor
Resource Report for gfx950-rocrollerResultsDetails✔️ No Resource Usage Changes ✔️ |
Contributor
Performance Report for gfx942ResultsDetails@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.67% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.81% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.22% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.10% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.78% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.93% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.95% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 0.79% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 1.24% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.38% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 1.90% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 0.95% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+ 16.87% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 1.96% | p=2.5347e-02
| 3. FloatsGEMM(M: 8448, N: 8448, K: 128, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=8448, N=8448, K=128, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
- 8.28% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
- 8.79% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: complex_mi_with_coop)| CodeGen() | CodeGen(instCount: 40000, instructions: complex_mi_with_coop)
- 8.11% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: simple_mi)| CodeGen() | CodeGen(instCount: 40000, instructions: simple_mi)
|
Contributor
Resource Report for gfx942ResultsDetails✔️ No Resource Usage Changes ✔️ |
Contributor
Performance Report for gfx12ResultsDetails@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.17% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
- 0.18% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
+ 0.28% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
- 0.25% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'half', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadLDSScale_A: False, loadLDSScale_B: False, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'half', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadLDSScale_A=False, loadLDSScale_B=False, swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
+ 1.55% | p=2.5347e-02
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
+ 2.64% | p=1.7451e-03
| CodeGen(instCount: 40000, instructions: complex_mi_with_coop)| CodeGen() | CodeGen(instCount: 40000, instructions: complex_mi_with_coop)
|
Contributor
Resource Report for gfx12ResultsDetails✔️ No Resource Usage Changes ✔️ |
Contributor
perfci run on commit a27c788 |
memmett
approved these changes
Nov 13, 2025
assistant-librarian Bot
pushed a commit
to ROCm/rocRoller
that referenced
this pull request
Nov 14, 2025
[rocroller] Only use pullRequest property on PRs ## Motivation Fix issue introduced by #2410 where the property `pullRequest` was accessed on both PR and non-PR paths of the groovy code executed as part of rocRoller's performance job. ## Technical Details See motivation. ## Test Plan See the performance job working after this PR is merged. ## Test Result Performance job runs and produces artifact. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Chun-Xun Lin <Chun-Xun.Lin@amd.com>
ammallya
pushed a commit
that referenced
this pull request
Feb 3, 2026
* Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2 [ROCm/composable_kernel commit: 8e1eb0c]
ammallya
pushed a commit
that referenced
this pull request
Feb 3, 2026
* Add stride validation to prevent segfault in blockscale GEMM * run clang-format * Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * added stride length checking to more gemm examples in ckprofiler * ran clang format * added validation header and implement in core gemm operations * remove ck_tile transpose and gemm stages from CI (#2646) * update CK build instruction step 4 (#2563) Co-authored-by: Aviral Goel <aviral.goel@amd.com> * Fixes to "General 2D Reduction Kernel" (#2535) (#2656) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format * enable aiter test_mha in daily CI (#2659) * feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582) * feat(copy_kernel): add basic copy kernel example with documentation * docs(CHANGELOG): Updated changelog * chore: performed clang format * Update example/ck_tile/39_copy/copy_basic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fix(terminology): follow amd terms * extract elementwise copy to a new kernel * fix(copy_kernel): bug in verification * add comments about vgpr usage * lint and nits * add notes and comments * print hostTensor via stream * print hostTensor via stream --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * [CK_TILE] FMHA BWD Optimization For GFX950 (#2628) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp * Fix num_byte calculations to use nhead_k for K & V size (#2653) Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300. Before: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s ``` After: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s ``` * [CK_TILE] FMHA BWD Decode Pipeline (#2643) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2 * fix (#2668) * Optimize fmha fwd decode & prefill for gfx950 (#2641) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> * Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670) This reverts commit 327bf40. * added batch stride checking to batched gemm ops in profiler * removed batch stride validation * removed batched stride validation again * Update include/ck/library/utility/profiler_validation_common.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * refactor function names * added gemm stride checking to more profiler gemm operations * run clang format * add stride checkign to 01 gemm example * rename from profiler to validation common, used for examples and profiler * build of ckProfiler success * update file headers --------- Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com> Co-authored-by: Cameron Shinn <camerontshinn@gmail.com> Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com> Co-authored-by: Haocong WANG <haocwang@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: 60320e9]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Fix issue introduced by #2410 where the property
pullRequestwas accessed on both PR and non-PR paths of the groovy code executed as part of rocRoller's performance job.Technical Details
See motivation.
Test Plan
See the performance job working after this PR is merged.
Test Result
Performance job runs and produces artifact.
Submission Checklist