[rocRoller] Fix mix of Direct2LDS and LoadTiled nodes for small tiles.#2584
[rocRoller] Fix mix of Direct2LDS and LoadTiled nodes for small tiles.#2584amd-chunxlin wants to merge 1 commit into
Conversation
Performance Report for gfx12ResultsDetailsComparison Summary
@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.12% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf16', 'type_B': 'bf16', 'type_C': 'bf16', 'type_D': 'bf16', 'type_acc': 'bf16', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Sequential, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf16', 'type_B': 'bf16', 'type_C': 'bf16', 'type_D': 'bf16', 'type_acc': 'bf16', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Sequential', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
- 0.17% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
+ 0.17% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Sequential, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Sequential', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
+ 0.28% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Sequential, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Sequential', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
- 2.41% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: complex_mi_with_coop)| CodeGen() | CodeGen(instCount: 40000, instructions: complex_mi_with_coop)
+ 5.75% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: simple_mi)| CodeGen() | CodeGen(instCount: 40000, instructions: simple_mi)
|
Resource Report for gfx12ResultsDetails✔️ No Resource Usage Changes ✔️ |
Generated Documentation |
Code Coverage Report for gfx942Summary
This PR adds/edits 135 newly uncovered lines. Artifacts
Commit Hashes |
Performance Report for gfx942ResultsDetailsComparison Summary
@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
- 0.04% | p=2.5347e-02
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.08% | p=1.7451e-03
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.04% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.08% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.40% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.27% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.94% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.38% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.91% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.31% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 3.09% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.58% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.87% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.19% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.21% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 2.30% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.51% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.33% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.86% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.79% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.47% | p=3.2348e-04
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
- 3.43% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: simple_mi)| CodeGen() | CodeGen(instCount: 40000, instructions: simple_mi)
|
Resource Report for gfx942ResultsDetails✔️ No Resource Usage Changes ✔️ |
perfci run on commit d6291a8 |
perfci run on commit 8a5a04865a49dc29f4c4b8ee659ffcedb07b4b8b |
0cf430b to
f6b7ff7
Compare
Performance Report for gfx950ResultsDetailsComparison Summary
@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
- 0.33% | p=2.3817e-04
| 3. FloatsGEMM(M: 256, N: 256, K: 256, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=256, N=256, K=256, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.77% | p=2.5347e-02
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.42% | p=2.5347e-02
| 3. FloatsGEMM(M: 3840, N: 4224, K: 4224, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=3840, N=4224, K=4224, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.54% | p=2.5347e-02
| 3. FloatsGEMM(M: 3840, N: 4224, K: 4224, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Sequential, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=3840, N=4224, K=4224, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Sequential', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.13% | p=3.1393e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 256, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=256, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.22% | p=5.1556e-106
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.21% | p=6.4857e-72
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.54% | p=3.6092e-32
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.18% | p=1.8265e-36
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.93% | p=1.3897e-108
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.06% | p=5.8519e-38
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.71% | p=1.9779e-48
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.41% | p=3.2100e-25
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.04% | p=6.1671e-28
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 2], 'scaleShuffleTileB': [32, 8, 2], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 2], 'scaleShuffleTileB': [32, 8, 2], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.49% | p=2.4527e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: False, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=False, version='')
- 1.10% | p=1.5045e-134
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 2.39% | p=4.8956e-82
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.63% | p=1.6039e-10
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [64, 4], 'scalePreTileB': [4, 64], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: False, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [64, 4], 'scalePreTileB': [4, 64], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=False, version='')
- 1.03% | p=4.8242e-20
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [64, 4], 'scalePreTileB': [4, 64], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [2048, 8], padLDS_B: [2048, 8], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: False, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [64, 4], 'scalePreTileB': [4, 64], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[2048, 8], padLDS_B=[2048, 8], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=False, version='')
- 1.38% | p=1.8920e-16
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 64653b7efe2) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.30% | p=1.1531e-26 Results truncated, see full report in workspace |
Resource Report for gfx950ResultsDetails✔️ No Resource Usage Changes ✔️ |
perfci run on commit 6e4704e |
6e4704e to
64653b7
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (77.72%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2584 +/- ##
========================================
Coverage 67.31% 67.31%
========================================
Files 1885 1885
Lines 292473 292473
Branches 41262 41262
========================================
Hits 196877 196877
Misses 78695 78695
Partials 16901 16901
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
|
This pull request has been inactive for 25 days and will be marked as stale. If you would like to keep this PR open, please:
This PR will be automatically closed in 5 days if no further activity occurs. |
|
This pull request has been automatically closed due to inactivity (30 days with no updates). If you'd like to continue working on this, feel free to reopen the PR or create a new one. |
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist