[rocthrust] Remove .jenkins folder#159
Merged
Merged
Conversation
samjwu
approved these changes
Jun 6, 2025
assistant-librarian Bot
pushed a commit
to ROCm/rocThrust
that referenced
this pull request
Jun 6, 2025
[rocthrust] Remove .jenkins folder
assistant-librarian Bot
pushed a commit
that referenced
this pull request
Jul 8, 2025
* Enable round-trip tests for single-process, multi-GPU tests * Adding a few sizes for single-process, multi-gpu tests so that hipfftXtExecDescriptorC2R gets used, too.
ammallya
pushed a commit
that referenced
this pull request
Jul 31, 2025
* Enable round-trip tests for single-process, multi-GPU tests * Adding a few sizes for single-process, multi-gpu tests so that hipfftXtExecDescriptorC2R gets used, too. [ROCm/hipFFT commit: 8e341ed]
ammallya
pushed a commit
that referenced
this pull request
Sep 24, 2025
* Defaulting clang-tidy checks to off during rename process * New rules * Descriptors renamed (#135) * New rules * Descriptors renamed * Formatting fixes * rename backend/src/plugin folder (#136) * rename backend/src/plugin folder * fix comments * missed 2 changes * Missed renaming shared library as part of plugins (#139) * Handle Renamed (#134) * rename handles * ignore hipdnnHandle in clang tidy * review comments * rename miopen legacy plugin engines. (#141) * Backend Rename Leftovers (#138) * Performed the rest of the backend renames * Format * PR comments * Renaming in hipdnn_frontend/attributes (#140) Update naming for our new tidy rules * renames for miopen_legacy_plugin/root (#142) * Renaming miopen legacy plugin base dir * fix missed variable and function * Rename sdk plugin (#145) * Rename Engine_config_interface * fix tests * Convert wrappers and test utils * Remaining plugin sdk renames * formatting * Apply formatting * Fix snake_case for last_erorr * Renaming in miopen legacy tests (#143) * first batch of legacy tests * tests set 2 for miopen legacy plugin * missed integration tests, fix small issues * fix * another fix * rename * Rename sdk plugin tests (#148) * Rename sdk/include/hipdnn_sdk/test_utilties symbols and usages (#146) * Rename sdk/include/hipdnn_sdk/test_utilties symbols and usages * Format * Fix test_tensor names * merge conflicts fix * format changes * format --------- Co-authored-by: Brian Harrison <brian.harrison@amd.com> * Update sdk logging names (#147) Co-authored-by: Brian Harrison <brian.harrison@amd.com> * Rename sdk test_utilities tests (#150) * Renaming complete in sdk utilities, tests, and samples. (#149) * Apply formatting * Rename frontend's backend wrappers and helpers (#152) * Renamed frontend's backend wrappers and helpers * Format * Rename backend integration tests (#153) * Rename backend integration tests * formatting * Fix missing rename variable * Renaming frontend/node and base dir (#157) * miopen plugin interface renames (#158) * /backend/tests/ rename (#156) * /backend/tests/ rename * Uncomment * Change global * Rename test plugins (#160) * Rename test plugins for the new naming scheme * Formatting * Fix accidental rename * Camel case for InitialziePlugin * Samples renamed (#159) * Renamed samples * Format * Typo and wrong casing --------- Co-authored-by: Samuel Reeder <samuel.reeder@amd.com> * Frontend integration tests renamed (#161) * Rename sdk tests utilities (#162) * Rename sdk test utilities to match new scheme * Swap to s_ instead of g_ for local static * camelCase * Replace global with test fixture (#163) * Replace global with test fixture * Change name * Rename test internals (#164) * Fix SDK tidy breaks * Backend fixes * More backend and updated clang-tidy for constexpr * Fix tidy errors * More tidy fixes for plugin, sdk, and backend * NOLINT setters and more of frontend * Fix backend tidy error * More frontend fixes * Lint free (by my reckoning) * I did this by hand * Add code style and naming doc (#165) * style doc v1 * add table of contents * fix first batch of comments * review concerns, and new section 14 * add links to style doc from other places * update both cline and copilot instructions to match (#166) * Tidy fixes (#167) * fix a few tidy issues * change tidy to * * Re-enable enum tidies * default tidy back on --------- Co-authored-by: Jeremy Hart <jeremy.hart@amd.com> * fix formatting (#168) --------- Co-authored-by: mousdahl-amd <mitch.ousdahl@amd.com> Co-authored-by: Adam Dickin <adam.dickin@amd.com> Co-authored-by: bibek <108366729+bghimireamd@users.noreply.github.com> Co-authored-by: Jeremy Hart <jeremy.hart@amd.com> Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com> Co-authored-by: Brian Harrison <brian.harrison@amd.com>
ammallya
pushed a commit
that referenced
this pull request
Sep 24, 2025
* Defaulting clang-tidy checks to off during rename process * New rules * Descriptors renamed (#135) * New rules * Descriptors renamed * Formatting fixes * rename backend/src/plugin folder (#136) * rename backend/src/plugin folder * fix comments * missed 2 changes * Missed renaming shared library as part of plugins (#139) * Handle Renamed (#134) * rename handles * ignore hipdnnHandle in clang tidy * review comments * rename miopen legacy plugin engines. (#141) * Backend Rename Leftovers (#138) * Performed the rest of the backend renames * Format * PR comments * Renaming in hipdnn_frontend/attributes (#140) Update naming for our new tidy rules * renames for miopen_legacy_plugin/root (#142) * Renaming miopen legacy plugin base dir * fix missed variable and function * Rename sdk plugin (#145) * Rename Engine_config_interface * fix tests * Convert wrappers and test utils * Remaining plugin sdk renames * formatting * Apply formatting * Fix snake_case for last_erorr * Renaming in miopen legacy tests (#143) * first batch of legacy tests * tests set 2 for miopen legacy plugin * missed integration tests, fix small issues * fix * another fix * rename * Rename sdk plugin tests (#148) * Rename sdk/include/hipdnn_sdk/test_utilties symbols and usages (#146) * Rename sdk/include/hipdnn_sdk/test_utilties symbols and usages * Format * Fix test_tensor names * merge conflicts fix * format changes * format --------- Co-authored-by: Brian Harrison <brian.harrison@amd.com> * Update sdk logging names (#147) Co-authored-by: Brian Harrison <brian.harrison@amd.com> * Rename sdk test_utilities tests (#150) * Renaming complete in sdk utilities, tests, and samples. (#149) * Apply formatting * Rename frontend's backend wrappers and helpers (#152) * Renamed frontend's backend wrappers and helpers * Format * Rename backend integration tests (#153) * Rename backend integration tests * formatting * Fix missing rename variable * Renaming frontend/node and base dir (#157) * miopen plugin interface renames (#158) * /backend/tests/ rename (#156) * /backend/tests/ rename * Uncomment * Change global * Rename test plugins (#160) * Rename test plugins for the new naming scheme * Formatting * Fix accidental rename * Camel case for InitialziePlugin * Samples renamed (#159) * Renamed samples * Format * Typo and wrong casing --------- Co-authored-by: Samuel Reeder <samuel.reeder@amd.com> * Frontend integration tests renamed (#161) * Rename sdk tests utilities (#162) * Rename sdk test utilities to match new scheme * Swap to s_ instead of g_ for local static * camelCase * Replace global with test fixture (#163) * Replace global with test fixture * Change name * Rename test internals (#164) * Fix SDK tidy breaks * Backend fixes * More backend and updated clang-tidy for constexpr * Fix tidy errors * More tidy fixes for plugin, sdk, and backend * NOLINT setters and more of frontend * Fix backend tidy error * More frontend fixes * Lint free (by my reckoning) * I did this by hand * Add code style and naming doc (#165) * style doc v1 * add table of contents * fix first batch of comments * review concerns, and new section 14 * add links to style doc from other places * update both cline and copilot instructions to match (#166) * Tidy fixes (#167) * fix a few tidy issues * change tidy to * * Re-enable enum tidies * default tidy back on --------- Co-authored-by: Jeremy Hart <jeremy.hart@amd.com> * fix formatting (#168) --------- Co-authored-by: mousdahl-amd <mitch.ousdahl@amd.com> Co-authored-by: Adam Dickin <adam.dickin@amd.com> Co-authored-by: bibek <108366729+bghimireamd@users.noreply.github.com> Co-authored-by: Jeremy Hart <jeremy.hart@amd.com> Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com> Co-authored-by: Brian Harrison <brian.harrison@amd.com> [ROCm/hipDNN commit: 839cf6c]
bethune-bryant
pushed a commit
that referenced
this pull request
Apr 3, 2026
* Refactor bpe utility functions from Python to C++ * Fix Num Record Bitfield for gfx125x
jichangjichang
pushed a commit
that referenced
this pull request
Apr 8, 2026
* Refactor bpe utility functions from Python to C++ * Fix Num Record Bitfield for gfx125x
jichangjichang
pushed a commit
that referenced
this pull request
Apr 13, 2026
* Refactor bpe utility functions from Python to C++ * Fix Num Record Bitfield for gfx125x
jichangjichang
pushed a commit
that referenced
this pull request
Apr 13, 2026
* Refactor bpe utility functions from Python to C++ * Fix Num Record Bitfield for gfx125x
tomchengchitang
pushed a commit
that referenced
this pull request
Apr 16, 2026
* Enable v_cvt_f16_fp8 and add v_cvt_pk_f16_fp8 This patch also rename original option: Hascvtf16_fp8 as Hascvtf16_fp8_sf32. Because the old Hascvtf16_fp8 is actually for v_cvt_scalef32_pk_f16_fp8. * true16 modifier implementation This patch includes true16 modifier implementation and related compiler option. * Optimize FP32 to FP16 with pack data by v_cvt_pk_f16_f32 * Make v_cvt_f32_f16 and v_cvt_f16_f32 support true16 syntax This change modifies the relevant code, except in mixed mode scenarios. * Support VCvtPkFP8toF32 and VCvtPkBF8toF32 with VOP3 in loopSum * Use S_MOV_B32 to reset EXEC for 32-lane wavefronts * Update test YAML for data conversion This patch update test YAML for the following instructions: v_cvt_f32_f16, v_cvt_f16_f32, v_cvt_pk_f16_f32, v_cvt_f32_bf16, v_cvt_f32_fp8, v_cvt_f32_bf8, v_cvt_pk_f32_fp8, v_cvt_pk_f32_bf8, v_cvt_pk_fp8_f32, v_cvt_pk_bf8_f32 * Minor modifications based on Copilot review recommendations * Fix v_cvt_f32_bf16 build error and reviewer's suggesstions * [hipSPARSELt] Enable 8-bit SPMM kernels (#123) * Fix 8-bit sparse kernels by: 1. Adding i32-fp32 conversion and 2. fixing tail-loop dense b cndmask offset. Additionally update 16-bit yamls * Fix 8-bit datatype packing issue. * 1. Update sparse gfx1250 testing yamls. These are simplified to reduce tox testing time, but tested on local end. 2. More yamls will be added. * Code refine, fix yamls and add metadatalayout test cases * Add and revise test yamls. * Fix i8 conversion issue when using GSU > 1. Add a missing yaml. * Follow up - Refine i8 conversion issue when using GSU > 1. Add GSU2 parameter in yamls. * remove whitespace in line 370 in GlobalWriteBatch.py * Enable edge and tail loop. (#143) Co-authored-by: George Tseng <george.tseng@amd.com> * use ShortBranchMaxLength for SCLongBranchVccnz * [hipBLASLt] extend the vgpr pool for the metatadata at OptNLL * [hipBLASLt] refactor the vgpr allocation of sparse metadata. * Don't use pack for metadata when LDSTr Metadata is enabled. * Refine sparse test for i8, f8 and b8. (#156) * fix buffer insts record for gfx1250 (#109) * fix buffer insts record for gfx1250 --------- Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: marhuang_amdeng <marhuang@amd.com> * 6-bit transpose load revisited (#157) * wip: basic tr6 ds load support * Support glvw == 16 for FP6 * Fixed wrong VGPR padding for 6-bit gl and lr * Implemented new thread mapping for ds tr6 load * wip: add more test cases for BF6 * Reject solution if GRVW exceeds corrresponding side of MT * Added F6 test cases for all transposes * Fixed wrong tranpose settings for f6/b6 tests * Fixed VGPR allocation for ds tr6 load * Fixed conflicts in auto-merging * enable extop for gfx11/12 (#129) Co-authored-by: boringmorning <huangchen1999@gmail.com> * Support edge/tail for FP6/BF6 (#161) * BF6/FP6 TN Tail when glvw=16/32 * Fix shiftLrElements in TailLoop * Add FP6/BF6 tail and edge testcases * Simplify the tail/edge checks for FP6/BF6 * Refactor BPE Utility Functions and Fix Num Record Bitfield (#159) * Refactor bpe utility functions from Python to C++ * Fix Num Record Bitfield for gfx125x * MX : init and host validation: fix * MX : codegen setting 1 * refactor calcLdsNumBytes * MX: solution 2 * reorder setGlobalReadVectorWidth * MX: solution 3 * MX: add UseGeneralizedNLCOne MXSA/B * MX: LDS size calculation * MX: codegen setting * move ValuMXSAB vgpr to beginning of VGPR pool p1 * MX: add kernel argument * MX : initKernel part * MX : vgpr set * move ValuMXSAB vgpr to beginning of VGPR pool p1 * MX: global read macro * MX : local read offset * MX: local write and global read offset * MX: global read increaments * MX: staggerU * MX: global load * MX: global read increament * MX: wait * MX: ds store * MX: localWriteSwapOffsets * MX: local read part * MX: MFMA * MX: lraDeclareAddresses * MX: SIA3 pre processing * MX: SIA3 * MX: SIA3 pack for MXSA/B * MX: SIA3 pack * fix: use _DepthUTc to replace DepthU * MX: Tail: localWriteResetOffsets * MX: Tail: vgpr alloc, remove staggerU, global read * MX: tailLoopAllocValuVgpr * MX: Tail: localRead init * MX: Tail: local read * MX: unsupported datatype exception * MX: add test yaml * F6 adjust LDS alignment * MX: F4 use F8 scale type * MX: F4: scale type E5M3 * MX: support MX with non-MX combination * MX : initKernel part * MX: Tail: remove dirty MX data * MX: Tail: release MX valu vgpr after tail * MX: Edge: MX Scale shiftptr align with noraml Buffer * Shiftptr: fix Shiftptr limitation * MX: update AssertSummationElementMultiple for MX * F4: use 32x16 step 1 * F4: use 32x16: local read: add numTilePerInst loop * F4: use 32x16: local read: fix offset * F4: use 32x16: wmma tail: add numTilePerInst loop * F4: use 32x16: wmma tail * F4: use 32x16: store D * F4: use 32x16: vectorwidth * F4: use 32x16: multi wave * F4: use 32x16: local read: use MIInputPerThUnroll * F4: use 32x16: local read: add numTilePerInst loop * F4: use 32x16: local read: dstr offset * F4: use 32x16: local read: fix wave offset * MX: F4: fix tmpVGPR allocation * MXF4: support 32x16 instruction * F4: use 32x16: multi wave: NT, TT * F4: use 32x16: wmma tail * MX: fix MX buffer load length * MX: HipBlasLt: add MXE8B32 F8 yamls * MX: HipBlasLt: add MXE8B32B B8 yamls * MX: HipBlasLt: add MXE8B16 F8 yamls * MX: HipBlasLt: add MXE8B16 B8 yamls * MX: HipBlasLt: add MXE8B16 F6 yamls * MX: HipBlasLt: add MXE8B16 B6 yamls * MX: HipBlasLt: add MXE8B32 B6 yamls * MX: HipBlasLt: add MXE8B32 F6 yamls * MX: HipBlasLt: add MXE8B32 F4 yamls * MX: HipBlasLt: add MXF8B32 F4 yamls * MX: HipBlasLt: add MXE5M3B32 F4 yamls * MX: HipBlasLt: add MXE5M3B16 F4 yamls * MX: HipBlasLt: add MXF8B16 F4 yamls * MX: HipBlasLt: add MXE8B16 F4 yamls * MX: HipBlasLt: add MXE8B32/MXF8B16 sample * MX: HipBlasLt: support MXE8B32/MXF8B16 F4 * MX: HipBlasLt: support MXF8B32/MXE8B16 F4 * MX: HipBlasLt: add MXF8B32/MXE8B16 sample * MX: HipBlasLt: support MXF8B32/MXE8B16 F4 * MX: HipBlasLt: add MXE5M3B32/MXE5M3B16 sample * EfficiencyMonitor: do nothing in set_device_id if not enabled * benchmark: enable F4/F6 type without block scale * MX: hipblaslt-bench: support MXE8B32/MXE8B16 F4 * MX: hipblaslt-bench: support all F4 mx scale type * MX: hipblaslt-bench: F8/F6/F4 smoke gtest * MX: hipblaslt-bench: limit hipblaslt_e8 rand_int/hpl init value * Fix hipblaslt-test failed with HHS This patch fix the following hipblaslt-test test case (HHS): smoke_matmul_bias_relu_SAV_smoke_f16_rf16_rf16_rf16_rf32_r_relu* * f8f6f6 mix mode on hipblaslt * Added TensorLoadToLds for rocisa * WIP: F8 codegen for TDM * WIP: several bug fixes for TDM * WIP: TDM wave separated initial support * WIP: TDM uses up to 12 SGPRs now * WIP: TDM optimize lds address swap * WIP: TDM supports FP4 now * WIP: temporarily integrate regular MX and TDM A&B * Fix bugs in UseSgprForGRO. * WIP: remove g2l vgpr allocations for A & B when TDM enabled * WIP: fixed waitcnt calculation when TDM enabled * WIP: TDM supports LdsAlignPow2 is False * WIP: reject non-TN TDM kernels * Fix large LDS logic (#189) * large LDS follow gfx950 logic * fix calculation of # local read/write calculation, lessen StoreSwapAddr enable constraits --------- Co-authored-by: boringmorning <huangchen1999@gmail.com> * [hipBLASLt] cherry-pick TDM MX changes (#411) * WIP: TDM for MX buffers * WIP: fixed TDM lds swap for MXSA and MXSB * WIP: reduce VGPR usage when TDM enabled * WIP: fixed TDM MT selection for MX buffer * WIP: adjusted TDM LDS swap implementation for StoreSwapAddr * WIP: several minor fixes for TDM codegen * Fixed incompatible s_waitcnt parameter * TDM: fix MXS staggerU * TDM padding (#423) * Support TDM padding * Add padding checks for TDM * Add mxf8ss_tdm test yaml * fix LdsBlockSizePerPadB reject condition * [hipBlasLt] remove mix mode yamls and tests * fix comment in hipblaslt.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove KWS keyword * fix helper.h typo * re-order header list in include CMakeLists.txt * remove some comment * fix build break * removd DepthU 128 form mxf8ss_tdm.yaml * [hipSPARSELt] Enable Depth-U=256 in 8-bit SPMM kernels (#166) * Fix SPMM DU256 issue. Add DepthU=256 test param in yamls. * Update spmm_i8 yaml. * Refine metadata iteration variables based on commit 45ec843. * Fix vgprPerInputM calculation. * Update spmm 8-bit yamls from clr0 to clr1. * [hipSPARSELt] Fix Solution.py on gfx1250. (#231) * Fix Solution.py on gfx1250. * Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py * Upload gfx1250 logic yamls to hipSPARSELt. * Update gfx1250 logic yaml. Fix inconsistent yamls, and disable UserArgs. * [Workaround] Bypass metadata layout when build hipSPARSELt. * [hipSPARSELt] Update setComputeInputType to setComputeInputTypeA/B in tensile_host.cpp * [hipSPARSELt] Fix sparse b tail loop mask issue. * [hipSPARSELt] Patch reduced-size smoke tests * [hipSPARSELt] Fix 8-bit metadata vGPRS number and idx. Add i8is support when using bias. * [hipSPARSELt][Patch] Fix bugs found on PR#633 based on PR#682 * Revert "[Temp] disable spmm test, since spmm is not ready." This reverts commit 8b1ab52. * [hipSPARSELt] Add gfx1250 I8IS (Int8->Int32) Tensile library logic for hipsparselt Add 8 GridBased library logic YAML files for I8 input / I32 output sparse matrix multiplication on gfx1250, covering all 4 transpose combinations (NN, NT, TN, TT) for both Sparse-A and Sparse-B. * [hipSPARSELt] Fix KeyError 'UnrollMajorLDSMetadata' for non-sparse kernels, and test arch typo in spmm_fp16_ml1.yaml Add sparse guard before accessing UnrollMajorLDSMetadata in KernelWriter.py and KernelWriterAssembly.py. This key is only set for sparse kernels in Solution.py, but was accessed unconditionally, causing KeyError for non-sparse configs (e.g. sgemm_xf32_asm). Also fix typo in spmm_fp16_ml1.yaml: skip-gfx942m -> skip-gfx942 so the gfx1250-only test is correctly skipped on gfx942 machines. * fix: use _DepthUTc to replace DepthU * handle float bpe * check mxBlock in prombleType before use it * typo in DSStoreB256 * modify auto assign LocalReadVectorWidthA/B * reject condition use LocalReadVectorWidthA/B * [hipSPARSELt][Tmp] Disable autoVectorWidthA/B in gfx1250 * Fix NameError SMovBX is not defined * Fix invalid assembly token '**' in GSU duBpe calculation mult_MI_Dim already contains a leading '*' (e.g. "*MI_M"), so the f-string should not add another '*', which produced invalid assembly like "64**MI_M" instead of "64*MI_M". * Fix undefined MI_M/MI_N assembly symbols in GSU swizzle code The GSU code in computeLoadSrd and graIncrements emitted string literals "*MI_M" / "*MI_N" as immediate operands in assembly instructions, but these are not valid assembler symbols and caused "expected relocatable expression" errors during .s -> .o assembly. Resolve MI_M/MI_N to their actual numeric values at code-generation time by multiplying with kernel["MatrixInstM"] / kernel["MatrixInstN"] in Python, so the assembler only sees plain integer immediates. * Reject SwizzleTensor with NumLoadsCoalesced > 1 (swizzleA.yaml) When SwizzleTensorA/B is enabled with NumLoadsCoalesced > 1, the tile offset stride calculation in graTileOffsets uses an inter-wave stride (numKr * WvG * swzBlockSize) that does not correctly map the second coalesced read for edge cases where MacroTile > actual matrix dimension (e.g., MT_M=256 with M=128). This causes the second tile read to access incorrect memory locations, resulting in numerical accuracy failures. All 58 FAILED tests in swizzleA.yaml had NLCA=2 + SwizzleTensorA + MT_M > M, while identical kernels with NLCA=1 passed. This bug was introduced during gfx1250 SwizzleTensor development where the NLCA>1 edge case in the swizzle global read path was not considered. Reject this configuration until the swizzle tile offset stride is properly fixed for multi-coalesced-load edge handling. * Fix epilogue ScaleAB load condition in fp8nfp16mix_fp8nss.yaml The epilogue ScaleAB loading code compared DataType with itself (always true) instead of with MacDataType, causing it to unconditionally load ScaleA from AddressScaleA even when preloadScaleA=True (DataTypeA > MacDataTypeA). Fix: Change the inner loop condition from DataType<=DataType (nop) to DataType<=MacDataType. * Fix DTL correctness for FP16/FP8: cast float bpe values to int for assembly operands * Fixed lds padding sanity check * Fix the logic path for VCvtF32toF16 * Fix incorrect SGPR global read offset for DirectToVgpr with UseSgprForGRO In computeScalarGroImpl(), the DirectToVgpr path used the unsuffixed kernel["LocalReadVectorWidth"] (defaults to -1) instead of the tensor- specific kernel["LocalReadVectorWidth{A,B}"]. This produced negative unrollStride values, corrupting ScalarGlobalReadOffset in the generated assembly and causing incorrect GEMM results. Test: dtv_gfx90a.yaml (DirectToVgprA=1, UseSgprForGRO=1, GRVWA=1, GRVWB=1) * convert offset from float to int before passing into DSModifiers * Fixed incorrect enum values * Removed deprecated rejection condition for swizzle kernel * [hipSPARSELt] Seperate SPMM wider localRead conditions from dense. Disable reject Sparse A kernel only support PGR with EPS=1 * Fix: skip Formocast prediction when analyticalHardware is unavailable The tensilelite-client crashed with a segfault in AllSolutionsIterator::preProblem when running bf16_tn_gfx12_predict.yaml on gfx1200. The root cause is that origami does not support gfx1200, so analyticalHardware (shared_ptr<origami::hardware_t>) is never initialized in HipAMDGPU. The getHardware() helper then dereferences a null pointer. * Fix NameError in SIA2: replace undefined instPerPack with instPerPackA/instPerPackB The variable instPerPack was renamed to instPerPackA and instPerPackB during a prior refactor, but 9 references in the scheduleIterAlg == 2 code path of _makeSubIterSchedule were not updated. This caused a NameError during assembly kernel generation for gfx1200 (RDNA4) targets, which are the only architectures with library configs using SIA2. * convert tailloopInNllmaxUnit from float to int * Fix DVT tail loop validation failure on gfx12 by adding missing bpe multiply The GLOBAL_OFFSET macro was refactored to move the BPE multiplication out of the macro and into each call site. However, the globalReadGuardK function's call to GLOBAL_OFFSET for computing the max valid address offset in DVT tail loops was not updated to include the external bpe multiply. * Skip F4 related test YAML for gfx1200 * Fix swizzle tensor SRD limit alignment in computeLoadSrd MX integration refactored the IndicesSummation check into a nested if/elif chain, making the swizzle alignment path unreachable. Swizzled tensor A used raw SizeL-1 instead of alignTo(SizeL, swzStride)-1, causing SRD limit to be too small and buffer_load to return zeros at K boundaries. Fixes 58 NLCA=2 test failures. * Fix missing arguments and conditions * remove added testcases in matmul_bias_vector_dst_fp16_32 * Update test to accept VEC16_UE4M3 scale mode in SetAttribute The MX F4 support commits moved VEC16_UE4M3 from the rejection list to an accepted case (mapped to ScalingFormat::Block_16_UE4M3) for MXFP4 on gfx1250, but the test was not updated accordingly. Fix the test to expect SUCCESS and verify the set/get round-trip. Test: hipblaslt-test --gtest_filter=*pre_checkin_aux_matmul_set_get_attr_f16_r* * Revert "Patch hipsparselt build system to use local tensilelite" This reverts commit dedf0a070911e739d3d049c059ba1e6903f4cea9. * [hipSPARSELt] Enable gfx1250 * [hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950 * remove deprecated variable: LdsAlignPow2 * TF32: remove deprecated code * Workaround for CMS to support separate LRVWA/B * [hipSPARSELt][Workaround] Correct wlrMultiple calculation for sparse WLR rejection For sparse A (Sparse==1), MIInputPerThreadA already accounts for the compressed data, but LocalReadVectorWidthA is based on the original (uncompressed) data size. Divide by an extra factor of 2 to correctly compute wlrMultiple for sparse A. Similarly for sparse B (Sparse==2), apply the same correction to wlrMultiple based on LocalReadVectorWidthB. Also clarify rejection messages to distinguish A/B tensor. * Fix: stuck in _makeSubIterSchedule() when build gfx1200 SIA2 kernel root cause: packItems is non-empty and it never check instPerPackB to pop items * Fix local read waitcnt * Revert "[Tensilelite][Sparse] Enable plr min for spmm (#4364)" This reverts commit cb81c2e. * Revert "[hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950" This reverts commit 8f32694cd27fc4554c82ac0fe8f4e4ca553e2510. * Fix lrvw undefined caused by conflicts resolving * Fix activation args type mismatch causing segfault in debug build prob.act0/act1 (float) were pushed into ConstantVariant without casting to the compute type. When compute type is Int32 (i8->i32 GEMM), subsequent std::get_if<int32_t> on a float-holding variant returned nullptr, causing SIGSEGV on dereference. Release builds masked this UB via -O3 optimizations. Affects: ./build/release/clients/hipblaslt-test --gtest_filter=*gemm_i8_dst_i32_94x* * Enable RocRoller for hipblaslt and fix host build issue * remove redundant pack code in xfp32 kernels * pick all fix from gfx1250 to develop-gfx1250-open-source (#969) * [hipblaslt] fix ds_bpermute_b32 msb computation and s_set_vgpr_msb (#910) * Add gfx1250 HHS AuxH yamls for gtest (#911) Co-authored-by: Andy Su <andysu12@amd.com> * fix ExtOpLayerNorm test (#935) enable extop for gfx11/12 (#129) Co-authored-by: Huang, Mark <Mark.Huang@amd.com> Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com> * [hipblaslt] fix index mode discarding valid algos when pool size < batch size (#960) getAlgosFromIndex returns INVALID_VALUE when any requested index exceeds the pool size, even though valid algos are still populated in the output. Previously the caller broke immediately on INVALID_VALUE, discarding those valid results. Now process the returned algos before exiting the loop. * [Tensilelite] modify NumRecords of E and BiasSrd for gfx1250 (#957) modify NumRecords of E and BiasSrd for gfx1250 * [hipblaslt] add correct num_records to BRD of gfx1250 (#958) --------- Co-authored-by: Chang, Josh <Josh.Chang@amd.com> Co-authored-by: Su, Andy <Andy.Su@amd.com> Co-authored-by: Andy Su <andysu12@amd.com> Co-authored-by: Huang, Mark <Mark.Huang@amd.com> Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com> * Init srdB when calclating KRingShift Fix hipblaslt-test cases: *bf16_rbf16_rbf16_rbf16_rf32_r_TN_128_128_2048* * skip gfx1250 for kringshift test * Fixed local write waitcnt calculation for f6 datatype * Fallback to narrowing buffer load (GRVW=1) if no partial OOB. (#991) Fix buffer load failure in tail loop. Fallback to narrowing buffer load (GRVW=1) if partial OOB is unsupported by hardware. * Add FP8/BF8 logic YAML & test coverage (#998) * Fix v_pk_mul_f32 SGPR operand error on gfx1250 Add op_sel_hi=[1,0,1] modifier to VMulPKF32 for ScaleD to properly broadcast scalar SGPR value on gfx1250 packed math instructions. * Add F8/B8 related logic YAML Add 80 GridBased logic YAML files to support FP8 and BF8 data types on gfx1250. This includes: - Multiple precision combinations: F8, B8, F8B8, B8F8 - Various output types: FP16, FP32, BF16, F8, B8 - All matrix layouts: NN, NT, TN, TT - Epilogue support: Bias, ScaleA/B, ScaleC/D * Enable F8/B8 test cases for gfx1250 in hipblaslt-test Update gpu_arch filters in matmul_gtest.yaml and smoke_gtest.yaml to include gfx1250, enabling the following test categories: "matmul_f8_bf8_dst_fp32" "matmul_f8_bf8_dst_bf16" "matmul_f8_dst_bf16" "matmul_f8_bf8_dst_f16" "matmul_f8_bf8_dst_fp32_gfx12" "matmul_f8_bf8_dst_fp16_gfx12" "matmul_f8_bf8_dst_bf16_gfx12" "matmul_real_1b_dst_f8_SCDInt1_gfx12" "matmul_real_1b_dst_f8_SCDNotInt_gfx12" "matmul_one_real_precisions_1b_gfx12" "matmul_f8_bf8_dst_fp16_gfx12_smoke" "matmul_f8_bf8_dst_bf16_gfx12_smoke" "matmul_real_1b_dst_f8_SCDInt1_gfx12_smoke" "matmul_real_1b_dst_f8_SCDNotInt_gfx12_smoke" * Update logic YAML for F8/B8 related 1x1x1 solutions Update Tensile logic YAML files across Ailk/Alik Bjlk/Bljk matrix layout combinations for FP8/BF8 data types (B8F8, F8B8, B8, F8 variants including HS, BS, SS subtypes). * Fix UseCustomMainLoopSchedule value type in F8/B8 logic YAMLs Change UseCustomMainLoopSchedule from boolean `false` to integer `0` across 80 gfx1250 GridBased logic YAML files for consistency with the expected integer type. * Changed `_UseSgprForGRO: false` to `_UseSgprForGRO: 0` * Fix uninitialized union members in f6/bf6x16 conversion helpers * Fix incorrect arg type of copy constructor for some ds_load inst * Move F4/F6 init out of Runner into sample-specific files * Add = default to Float6x16 default constructor * Remove unused iType param from MXMFMAInstruction::typeConvert * Refactor scaleA/B type dispatch to switch for -Wswitch coverage * Remove duplicated HasWMMA_f8f6f4 capability * Fix default cnt for SWaitTensorCnt to 0. * Remove datatype examples that are not currently supported. (#1017) Change the order of enum HIPBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE8M0_EXT. * Reapply "[Tensilelite][Sparse] Enable plr min for spmm (#4364)" This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda. * [hipSPARSELt] Fix packItemsM logic for gfx1250 to be aligned with other architectures * [hipSPARSELt] Reuse single PackTemp VGPR for MIInputPerThUnroll==8 metadata packing In the MIInputPerThUnroll==8 packing path, PackTemp's lifetime ends before the second group of packing operations begins, so PackTemp can be safely reused instead of requiring PackTemp+1. This aligns the code generation with the VGPR allocation logic which only reserves 1 VGPR for PackTemp. Also simplify the gfx1250 sparse PackTemp allocation condition by removing the redundant MIInputPerThreadMetadata>1 check, since gfx1250 only has MIInputPerThreadMetadata = 4 or 8. * Fix incorrect function name of computeInputType{A, B} calls * Fix missing mxsb when rebasing * Revert modifications for emulator (#1001) Remove emulator parameter: ROCmAgentEnumeratorPath. Remove env parameters in tox.ini. Remove compile and emulator workarounds. Set default CpuThreads back to -1. * Enable HasF32XEmulation in gfx1250 * Fix XF32 LocalRead VGPR packing for gfx1250 WMMA V3 Fix incorrect results produced by XF32 emulation kernels on gfx1250. The original XF32 codegen was written against gfx950 MFMA assumptions that do not hold for gfx1250 WMMA V3: 1. LDS offsets used gfx950-specific hardcoded constants (4, 12). gfx1250 WMMA V3 needs a *2 unroll-stride formula shared with BF16/Half. Branch by ISA via calcGfx1250LdsOffset(). 2. Pack logic assumed vgprPerInput ≤ 8 (single 8-VGPR group). gfx1250 has vgprPerInput=16 (two groups), producing interleaved [HI_g0, LO_g0, HI_g1, LO_g1]. Add v_swap_b32 to rearrange into contiguous [HI_all, LO_all] expected by 3-pass WMMA. 3. WMMA src offset hardcoded "+2"/"+4" for vgprPerInput 4/8. Replace with dynamic vgprPerInputA // 2 to yield "+8" on gfx1250. * Fix XF32 Direct32XEmulation pack/WMMA scheduling data hazard in SIA3 on gfx1250 SIA3 scheduler interleaved pack and MAC instructions without respecting data dependencies in the XF32 multigroup path, causing v_swap_b32 to corrupt F32 values mid-packing, and WMMAs to consume partially-packed VGPRs. - Move v_swap_b32 rearrangement from MAC code into pack code so it stays ordered after all TF32_1/TF32_2 packing - Fix destVgpr aliasing for UseDirect32XEmulation local reads - Place all XF32 pack items before the first WMMA instead of distributing one chunk per MFMA slot * Fix XF32 tail loop K-masking on gfx1250 WMMA V3 The tail loop K-masking logic was written against gfx950 MFMA geometry where vgprPerInput ≤ 8 and BF16 inputs are packed (2 elements/VGPR). gfx1250 WMMA V3 has vgprPerInput=16 and XF32 reads unpacked FP32 (1 element/VGPR), breaking two assumptions: 1. T0 VGPR addressing: gfx950 bk maps 1:1 to T0 slots. gfx1250 Direct32X allocates T0 at half capacity (8 slots for 16 elements), so raw bk overflows into wrong tensor's registers. Fix: adjustedBk = (bk // 8) * 4 + (bk % 4). 2. K-to-VGPR mapping: gfx950 packed BF16 gives contiguous {0-7, 16-23}. gfx1250 unpacked FP32 + numVecUnroll=2 interleaving gives {0-3, 8-11, 16-19, 24-27}, zeroing wrong VGPRs for K=5-11, 21-27. Fix: vgprPerSet0Group=1, multiplyBy /= numVecUnroll, absolute K offsets per group. * Disable ForceUnrollSubIter for F32X emulation F32X emulation pack code performs destructive in-place VGPR conversion (FP32 → BF16 high/low), which is incompatible with ForceUnrollSubIter's sub-tiling that splits local reads and pack code across sub-iterations. This caused validation failures with ScheduleIterAlg=1, MIWaveTile=[4,4], and DepthU==MatrixInstK. * Generalize MIInputPerThread for gfx1250 WMMA XF32 Hardcoded MIInputPerThread==8 assertion in LocalRead.py caused AssertionError on gfx1250 (MIInputPerThread=16). Parameterize TXInterleaveLayoutIdx, dynamically generate dsReadConvTable and convArray to support any MIInputPerThread value. * Disable UseMFMAF32XEmulation on WMMA-only ISAs (gfx1250) UseMFMAF32XEmulation was unconditionally enabled for all F32X kernels, causing gfx1250 (WMMA, no MFMA) to emit invalid v_wmma_f32_4x4x4_bf16 instructions. Gate the flag behind HasMFMA so WMMA architectures fall through to the cvt+sub path instead. * Fix lrvwTile not forced to 1 for non-MFMA XF32 (gfx1250 WMMA) The blanket "(not UseF32XEmulation)" exemption skipped lrvwTile=1 forcing for all XF32 paths, but only MFMA-based XF32 (gfx950) handles lrvwTile > 1 correctly. On gfx1250 WMMA, lrvwTile=2 produced incorrect local reads. Refine the exemption: only UseMFMAF32XEmulation and CMS kernels may keep lrvwTile > 1; non-MFMA XF32 paths are now forced to lrvwTile=1. * Fix TF32EmuInterleaveTreg local read index for non-prefetch path Problem: NT/TN format XF32 kernels produce inf/nan errors when TF32EmuInterleaveTreg is enabled but doFullPackCodePrefetch is False (PLR=0). The TXInterleaveLayoutIdx() function assumes the full prefetch pack code layout, which is incompatible with the non-prefetch register layout. Fix: Add conditional branching based on doFullPackCodePrefetch in the TF32EmuInterleaveTreg handling. For the non-prefetch path (PLR=0), use a simpler index calculation that maps the first half of each group (withinGroup < 4) to T registers with a straightforward index formula: idx = (idx // 8) * 4 + withinGroup. This matches the register layout expected by the pack code when doFullPackCodePrefetch is False. * Fix TF32 emulation T-register overlap in tail loop Problem: TT and NN format kernels with DepthU=32 failed validation for tail loop. Root cause: In macroAndSetF32XEmuTregSingle(), the T registers (vgprValuA_T0_I0, vgprValuB_T0_I0) were defined using symbolic references relative to vgprValu{A/B}_X0_I0_BASE: .set vgprValuB_T0_I0, vgprValuB_X0_I0_BASE + 56 In the main loop, vgprValuB_X0_I0_BASE=34 gives T0=90 (correct). In the tail loop, vgprValuB_X0_I0_BASE is redefined to 32, giving T0=88 which overlaps with vgprValuA_T0_I0+6 (82+6=88). This causes A's TF32 processing to corrupt B's T registers (v88-v89), leading to incorrect WMMA results. Fix: Use absolute startVgprCvt values instead of symbolic BASE-relative offsets in RegSet. This ensures T register addresses remain correct regardless of BASE redefinition in the tail loop. * Enable XFP32 test coverage for gfx1250 in hipblaslt-test and tox * Enable gradient, postprocessing, and fix CVT instructions for gfx1250. (#1042) * Fix issues in cvt, enable gradient support for gfx1250. Cherry-picked from PR #160: - 35e2ef29dd (Fix issues in cvt and initial support on hhs gradient) - 80cce534da (Enable bbs gradient and postprocessing) * Fix wave32 and FP16 gradient issues on gfx1250 - Fix BF16 NaN check in writeBiasToGlobal for wave32 - Add fallback for FP16 sum unroll when dot2 is unavailable - Enable gfx1250 gtests for dgelu, bgrada, bgradb --------- Co-authored-by: George Tseng <george.tseng@amd.com> Co-authored-by: Andy Su <andysu12@amd.com> * Remove unused segmentsize (#1049) * include fp6 bf6 header * Don't use hip f6x16 until it is ready * Fix: Python 3.10 doesn't support [] inside f-string expressions * clear wmma_v2 flag if wmma_v3 is detected * only set new msb value untile compiler support it * added int and uint32_t constructors to three packed floating-point types * revert the tests for backward path for gfx1250 * [tensilelite] skip 120x for gfx1250 tox tests * support mix mode in test_CustomSchedule.py * Add workaround for size mismatch * Extend size mismatch workaround to FP4 types * [hipSPARSELt] Reapply: Make extops and matrix-transform subdirectories conditional Guard add_subdirectory(extops) and add_subdirectory(matrix-transform) with HIPBLASLT_ENABLE_EXTOPS and HIPBLASLT_ENABLE_MATRIX_TRANSFORM flags respectively, allowing builds to opt out of these components. * Revert "[hipSPARSELt] Patch reduced-size smoke tests" This reverts commit 462cc81. * 1. Guard packed-type getElement overload with #if to fix HIP device compile error 2. Fix -Wswitch warnings for unhandled Float6/BFloat6/Float4 enum values 3. Typo in BadInput/BadOutput error messages for Float6x16 and BFloat6x16 4. Add WIN32 macro * Remove NB_SHARED from rocisa to fix Windows DLL import failure * Fix MX on Windows --------- Co-authored-by: Stacey Lai <stacey.lai@amd.com> Co-authored-by: Lin, Ed <Ed.Lin@amd.com> Co-authored-by: Wu, Brianna <Brianna.Wu@amd.com> Co-authored-by: George Tseng <george.tseng@amd.com> Co-authored-by: yu-hsieh <Yu-cheng.Hsieh@amd.com> Co-authored-by: Vin Huang <vin.huang@amd.com> Co-authored-by: Huang, Mark <Mark.Huang@amd.com> Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: marhuang_amdeng <marhuang@amd.com> Co-authored-by: Lu, Serge <Serge.Lu@amd.com> Co-authored-by: Yang, Anne <Anne.Yang@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Stacey Lai <stacelai@amd.com> Co-authored-by: mengzcai_amdeng <Meng-Zhe.Cai@amd.com> Co-authored-by: jichang <jimmy.chang@amd.com> Co-authored-by: Ho, Henry <Henry.Ho@amd.com> Co-authored-by: Chang, Josh <Josh.Chang@amd.com> Co-authored-by: Su, Andy <Andy.Su@amd.com> Co-authored-by: Andy Su <andysu12@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.