Skip to content

Issue Template issue #51#73

Closed
ammallya wants to merge 7 commits into
developfrom
ammallya-patch-1
Closed

Issue Template issue #51#73
ammallya wants to merge 7 commits into
developfrom
ammallya-patch-1

Conversation

@ammallya
Copy link
Copy Markdown
Collaborator

No description provided.

Comment thread .github/ISSUE_TEMPLATE/issue_report.yml Outdated
ammallya pushed a commit that referenced this pull request Sep 24, 2025
* Add engine heuristic descriptor

* Add setting the heuristic mode

* format

* Add initial tests for heuristic descriptor

* Fixing test breaks, and defects

* Add additional test cases

* Add remaining API tests and fix breaks

* Add finalization changes to graph_descriptor, and update associated tests. Also code review comments.

* Format missing file

* Add retriving the heuristic mode

* Resolve merge conflicts, and new memory leaks

* Fix casting

* Fix cast warnings

* Formatting

* Remove duplicated comment

* Swap usaged of reinterpret cast for void* usage

* Swap backend descriptors to be wrappers that store shared references

* Bring tests back for merge

* merge fixes and disable tests again

* Fix some memory tracking issues, and fix test

* Fix more tests, and add utilities to improve readability / usage of the new wrappers.

* Update tests, fix memory leaks, and resolve breaks

* formatting

* Re-enable graph tests

* Clean-up naming, namespaces, and tests

* Add tests for new getters

* Apply formatting

* Swap to pack utility method

* Remove pointless local

* Fix build breaks, and implement a bit of execute

* Re-enable variant pack tests

* Formatting
ammallya pushed a commit that referenced this pull request Sep 24, 2025
* Add engine heuristic descriptor

* Add setting the heuristic mode

* format

* Add initial tests for heuristic descriptor

* Fixing test breaks, and defects

* Add additional test cases

* Add remaining API tests and fix breaks

* Add finalization changes to graph_descriptor, and update associated tests. Also code review comments.

* Format missing file

* Add retriving the heuristic mode

* Resolve merge conflicts, and new memory leaks

* Fix casting

* Fix cast warnings

* Formatting

* Remove duplicated comment

* Swap usaged of reinterpret cast for void* usage

* Swap backend descriptors to be wrappers that store shared references

* Bring tests back for merge

* merge fixes and disable tests again

* Fix some memory tracking issues, and fix test

* Fix more tests, and add utilities to improve readability / usage of the new wrappers.

* Update tests, fix memory leaks, and resolve breaks

* formatting

* Re-enable graph tests

* Clean-up naming, namespaces, and tests

* Add tests for new getters

* Apply formatting

* Swap to pack utility method

* Remove pointless local

* Fix build breaks, and implement a bit of execute

* Re-enable variant pack tests

* Formatting

[ROCm/hipDNN commit: 5ec25f0]
stanleytsang-amd pushed a commit that referenced this pull request Dec 12, 2025
## Motivation

Enable gfx1152 and gfx1153.

## Technical Details

1. combine arrays into tables and use local macros to reduce repetition
(for maintainability)
2. monkey-see-monkey-do wherever `gfx11...` was found

## Test Plan

Build existing ctests for, and run them on, gfx1152 and gfx1153.

## Test Result

### 

<details>
<summary>gfx1152 passed (click to see log)</summary>

```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 2 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
      Start  1: hip.device_api
      Start  2: hip.async_copy
 1/73 Test  #2: hip.async_copy ..............................   Passed    0.01 sec
      Start  3: hip.ordered_block_id
 2/73 Test  #1: hip.device_api ..............................   Passed    0.02 sec
      Start  4: rocprim.internal_merge_path
 3/73 Test  #4: rocprim.internal_merge_path .................   Passed    0.01 sec
      Start  5: rocprim.basic_test
 4/73 Test  #3: hip.ordered_block_id ........................   Passed    0.01 sec
      Start  6: rocprim.arg_index_iterator
 5/73 Test  #5: rocprim.basic_test ..........................   Passed    0.01 sec
      Start  7: rocprim.temporary_storage_partitioning
 6/73 Test  #6: rocprim.arg_index_iterator ..................   Passed    0.01 sec
      Start  8: rocprim.block_adjacent_difference
 7/73 Test  #7: rocprim.temporary_storage_partitioning ......   Passed    0.01 sec
      Start  9: rocprim.block_discontinuity
 8/73 Test  #8: rocprim.block_adjacent_difference ...........   Passed    2.34 sec
      Start 10: rocprim.bit_cast
 9/73 Test #10: rocprim.bit_cast ............................   Passed    0.02 sec
      Start 11: rocprim.block_exchange
10/73 Test #11: rocprim.block_exchange ......................   Passed    0.73 sec
      Start 12: rocprim.block_histogram
11/73 Test #12: rocprim.block_histogram .....................   Passed    0.54 sec
      Start 13: rocprim.block_load_store
12/73 Test #13: rocprim.block_load_store ....................   Passed    0.44 sec
      Start 14: rocprim.block_sort_merge
13/73 Test #14: rocprim.block_sort_merge ....................   Passed    0.02 sec
      Start 15: rocprim.block_sort_merge_stable
14/73 Test #15: rocprim.block_sort_merge_stable .............   Passed    0.02 sec
      Start 16: rocprim.block_radix_rank
15/73 Test #16: rocprim.block_radix_rank ....................   Passed    0.03 sec
      Start 17: rocprim.block_radix_sort
16/73 Test #17: rocprim.block_radix_sort ....................   Passed    4.79 sec
      Start 18: rocprim.block_reduce
17/73 Test #18: rocprim.block_reduce ........................   Passed    0.26 sec
      Start 19: rocprim.block_run_length_decode
18/73 Test #19: rocprim.block_run_length_decode .............   Passed    0.54 sec
      Start 20: rocprim.block_scan
19/73 Test #20: rocprim.block_scan ..........................   Passed    0.04 sec
      Start 21: rocprim.block_shuffle
20/73 Test #21: rocprim.block_shuffle .......................   Passed    2.70 sec
      Start 22: rocprim.block_sort_bitonic
21/73 Test  #9: rocprim.block_discontinuity .................   Passed   17.36 sec
      Start 23: rocprim.config_dispatch
22/73 Test #23: rocprim.config_dispatch .....................   Passed    0.09 sec
      Start 24: rocprim.constant_iterator
23/73 Test #24: rocprim.constant_iterator ...................   Passed    0.07 sec
      Start 25: rocprim.counting_iterator
24/73 Test #25: rocprim.counting_iterator ...................   Passed    0.07 sec
      Start 26: rocprim.device_batch_memcpy
25/73 Test #26: rocprim.device_batch_memcpy .................   Passed    1.20 sec
      Start 27: rocprim.device_binary_search
26/73 Test #27: rocprim.device_binary_search ................   Passed    0.02 sec
      Start 28: rocprim.device_adjacent_difference
27/73 Test #28: rocprim.device_adjacent_difference ..........   Passed    0.01 sec
      Start 29: rocprim.device_adjacent_find
28/73 Test #29: rocprim.device_adjacent_find ................   Passed    0.01 sec
      Start 30: rocprim.device_find_end
29/73 Test #30: rocprim.device_find_end .....................   Passed    0.01 sec
      Start 31: rocprim.device_histogram
30/73 Test #22: rocprim.block_sort_bitonic ..................   Passed   13.23 sec
      Start 32: rocprim.device_merge
31/73 Test #31: rocprim.device_histogram ....................   Passed    7.85 sec
      Start 33: rocprim.nth_element
32/73 Test #33: rocprim.nth_element .........................   Passed    0.03 sec
      Start 34: rocprim.device_partial_sort
33/73 Test #34: rocprim.device_partial_sort .................   Passed    0.02 sec
      Start 35: rocprim.device_reduce
34/73 Test #35: rocprim.device_reduce .......................   Passed    8.94 sec
      Start 36: rocprim.device_run_length_encode
35/73 Test #32: rocprim.device_merge ........................   Passed   14.05 sec
      Start 37: rocprim.device_search
36/73 Test #37: rocprim.device_search .......................   Passed    0.02 sec
      Start 38: rocprim.device_segmented_radix_sort
37/73 Test #36: rocprim.device_run_length_encode ............   Passed   13.92 sec
      Start 39: rocprim.device_search_n
38/73 Test #39: rocprim.device_search_n .....................   Passed    0.02 sec
      Start 40: rocprim.device_segmented_reduce
39/73 Test #40: rocprim.device_segmented_reduce .............   Passed    5.21 sec
      Start 41: rocprim.device_segmented_scan
40/73 Test #41: rocprim.device_segmented_scan ...............   Passed    0.02 sec
      Start 42: rocprim.device_transform
41/73 Test #42: rocprim.device_transform ....................   Passed   13.90 sec
      Start 43: rocprim.discard_iterator
42/73 Test #43: rocprim.discard_iterator ....................   Passed    0.07 sec
      Start 44: rocprim.radix_key_codec
43/73 Test #44: rocprim.radix_key_codec .....................   Pas09:54:11 [55/1943]
      Start 45: rocprim.predicate_iterator
44/73 Test #45: rocprim.predicate_iterator ..................   Passed    0.07 sec
      Start 46: rocprim.reverse_iterator
45/73 Test #46: rocprim.reverse_iterator ....................   Passed    0.09 sec
      Start 47: rocprim.rocprim_tuple
46/73 Test #47: rocprim.rocprim_tuple .......................   Passed    0.01 sec
      Start 48: rocprim.rocprim_types
47/73 Test #48: rocprim.rocprim_types .......................   Passed    0.01 sec
      Start 49: rocprim.texture_cache_iterator
48/73 Test #49: rocprim.texture_cache_iterator ..............   Passed    0.01 sec
      Start 50: rocprim.thread
49/73 Test #50: rocprim.thread ..............................   Passed    0.07 sec
      Start 51: rocprim.thread_algos
50/73 Test #51: rocprim.thread_algos ........................   Passed    0.35 sec
      Start 52: rocprim.tuple
51/73 Test #52: rocprim.tuple ...............................   Passed    0.02 sec
      Start 53: rocprim.utils_sort_checker
52/73 Test #53: rocprim.utils_sort_checker ..................   Passed    0.01 sec
      Start 54: rocprim.transform_iterator
53/73 Test #54: rocprim.transform_iterator ..................   Passed    0.11 sec
      Start 55: rocprim.type_traits_interface_cpp17
54/73 Test #55: rocprim.type_traits_interface_cpp17 .........   Passed    0.01 sec
      Start 56: rocprim.type_traits_interface_gnupp17
55/73 Test #56: rocprim.type_traits_interface_gnupp17 .......   Passed    0.01 sec
      Start 57: rocprim.type_traits_interface_cpp20
56/73 Test #57: rocprim.type_traits_interface_cpp20 .........   Passed    0.01 sec
      Start 58: rocprim.type_traits_interface_gnupp20
57/73 Test #58: rocprim.type_traits_interface_gnupp20 .......   Passed    0.01 sec
      Start 59: rocprim.no_half_operators
58/73 Test #59: rocprim.no_half_operators ...................   Passed    0.01 sec
      Start 60: rocprim.intrinsics
59/73 Test #60: rocprim.intrinsics ..........................   Passed    0.21 sec
      Start 61: rocprim.intrinsics_atomic
60/73 Test #61: rocprim.intrinsics_atomic ...................   Passed    0.02 sec
      Start 62: rocprim.invoke_result
61/73 Test #62: rocprim.invoke_result .......................   Passed    0.01 sec
      Start 63: rocprim.warp_exchange
62/73 Test #63: rocprim.warp_exchange .......................   Passed    0.08 sec
      Start 64: rocprim.warp_load
63/73 Test #64: rocprim.warp_load ...........................   Passed    0.08 sec
      Start 65: rocprim.warp_reduce
64/73 Test #65: rocprim.warp_reduce .........................   Passed    0.14 sec
      Start 66: rocprim.warp_scan
65/73 Test #66: rocprim.warp_scan ...........................   Passed    0.20 sec
      Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
66/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ...   Passed    0.21 sec
      Start 68: rocprim.warp_sort
67/73 Test #68: rocprim.warp_sort ...........................   Passed    0.09 sec
      Start 69: rocprim.warp_store
68/73 Test #69: rocprim.warp_store ..........................   Passed    0.02 sec
      Start 70: rocprim.zip_iterator
69/73 Test #70: rocprim.zip_iterator ........................   Passed    0.02 sec
      Start 71: rocprim.accumulator_t
70/73 Test #71: rocprim.accumulator_t .......................   Passed    0.02 sec
      Start 72: hipgraph.basic
71/73 Test #72: hipgraph.basic ..............................   Passed    0.02 sec
      Start 73: hipgraph.algs
72/73 Test #73: hipgraph.algs ...............................   Passed    0.01 sec
73/73 Test #38: rocprim.device_segmented_radix_sort .........   Passed   31.97 sec

100% tests passed, 0 tests failed out of 73

Total Test time (real) =  71.80 sec
✅ test_rocprim.py PASSED
```

</details>



<details>
<summary>gfx1153 passed (click to see log)</summary>

```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 1 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
      Start  1: hip.device_api
 1/73 Test  #1: hip.device_api ..............................   Passed    0.01 sec
      Start  2: hip.async_copy
 2/73 Test  #2: hip.async_copy ..............................   Passed    0.01 sec
      Start  3: hip.ordered_block_id
 3/73 Test  #3: hip.ordered_block_id ........................   Passed    0.01 sec
      Start  4: rocprim.internal_merge_path
 4/73 Test  #4: rocprim.internal_merge_path .................   Passed    0.01 sec
      Start  5: rocprim.basic_test
 5/73 Test  #5: rocprim.basic_test ..........................   Passed    0.01 sec
      Start  6: rocprim.arg_index_iterator
 6/73 Test  #6: rocprim.arg_index_iterator ..................   Passed    0.01 sec
      Start  7: rocprim.temporary_storage_partitioning
 7/73 Test  #7: rocprim.temporary_storage_partitioning ......   Passed    0.01 sec
      Start  8: rocprim.block_adjacent_difference
 8/73 Test  #8: rocprim.block_adjacent_difference ...........   Passed    2.94 sec
      Start  9: rocprim.block_discontinuity
 9/73 Test  #9: rocprim.block_discontinuity .................   Passed   21.10 sec
      Start 10: rocprim.bit_cast
10/73 Test #10: rocprim.bit_cast ............................   Passed    0.01 sec
      Start 11: rocprim.block_exchange
11/73 Test #11: rocprim.block_exchange ......................   Passed    2.20 sec
      Start 12: rocprim.block_histogram
12/73 Test #12: rocprim.block_histogram .....................   Passed    0.71 sec
      Start 13: rocprim.block_load_store
13/73 Test #13: rocprim.block_load_store ....................   Passed    0.48 sec
      Start 14: rocprim.block_sort_merge
14/73 Test #14: rocprim.block_sort_merge ....................   Passed    0.02 sec
      Start 15: rocprim.block_sort_merge_stable
15/73 Test #15: rocprim.block_sort_merge_stable .............   Passed    0.02 sec
      Start 16: rocprim.block_radix_rank
16/73 Test #16: rocprim.block_radix_rank ....................   Passed    0.02 sec
      Start 17: rocprim.block_radix_sort
17/73 Test #17: rocprim.block_radix_sort ....................   Passed    6.12 sec
      Start 18: rocprim.block_reduce
18/73 Test #18: rocprim.block_reduce ........................   Passed    0.31 sec
      Start 19: rocprim.block_run_length_decode
19/73 Test #19: rocprim.block_run_length_decode .............   Passed    0.68 sec
      Start 20: rocprim.block_scan
20/73 Test #20: rocprim.block_scan ..........................   Passed    0.03 sec
      Start 21: rocprim.block_shuffle
21/73 Test #21: rocprim.block_shuffle .......................   Passed    3.63 sec
      Start 22: rocprim.block_sort_bitonic
22/73 Test #22: rocprim.block_sort_bitonic ..................   Passed   19.34 sec
      Start 23: rocprim.config_dispatch
23/73 Test #23: rocprim.config_dispatch .....................   Passed    0.10 sec
      Start 24: rocprim.constant_iterator
24/73 Test #24: rocprim.constant_iterator ...................   Passed    0.09 sec
      Start 25: rocprim.counting_iterator
25/73 Test #25: rocprim.counting_iterator ...................   Passed    0.09 sec
      Start 26: rocprim.device_batch_memcpy
26/73 Test #26: rocprim.device_batch_memcpy .................   Passed    1.42 sec
      Start 27: rocprim.device_binary_search
27/73 Test #27: rocprim.device_binary_search ................   Passed    0.01 sec
      Start 28: rocprim.device_adjacent_difference
28/73 Test #28: rocprim.device_adjacent_difference ..........   Passed    0.01 sec
      Start 29: rocprim.device_adjacent_find
29/73 Test #29: rocprim.device_adjacent_find ................   Passed    0.01 sec
      Start 30: rocprim.device_find_end
30/73 Test #30: rocprim.device_find_end .....................   Passed    0.01 sec
      Start 31: rocprim.device_histogram
31/73 Test #31: rocprim.device_histogram ....................   Passed    8.77 sec
      Start 32: rocprim.device_merge
32/73 Test #32: rocprim.device_merge ........................   Passed   16.23 sec
      Start 33: rocprim.nth_element
33/73 Test #33: rocprim.nth_element .........................   Passed    0.01 sec
      Start 34: rocprim.device_partial_sort
34/73 Test #34: rocprim.device_partial_sort .................   Passed    0.02 sec
      Start 35: rocprim.device_reduce
35/73 Test #35: rocprim.device_reduce .......................   Passed   10.92 sec
      Start 36: rocprim.device_run_length_encode
36/73 Test #36: rocprim.device_run_length_encode ............   Passed   14.34 sec
      Start 37: rocprim.device_search
37/73 Test #37: rocprim.device_search .......................   Passed    0.01 sec
      Start 38: rocprim.device_segmented_radix_sort
38/73 Test #38: rocprim.device_segmented_radix_sort .........   Passed   38.28 sec
      Start 39: rocprim.device_search_n
39/73 Test #39: rocprim.device_search_n .....................   Passed    0.02 sec
      Start 40: rocprim.device_segmented_reduce
40/73 Test #40: rocprim.device_segmented_reduce .............   Passed    7.19 sec
      Start 41: rocprim.device_segmented_scan
41/73 Test #41: rocprim.device_segmented_scan ...............   Passed    0.02 sec
      Start 42: rocprim.device_transform
42/73 Test #42: rocprim.device_transform ....................   Passed   17.64 sec
      Start 43: rocprim.discard_iterator
43/73 Test #43: rocprim.discard_iterator ....................   Passed    0.12 sec
      Start 44: rocprim.radix_key_codec
44/73 Test #44: rocprim.radix_key_codec .....................   Passed    0.01 sec
      Start 45: rocprim.predicate_iterator
45/73 Test #45: rocprim.predicate_iterator ..................   Passed    0.08 sec
      Start 46: rocprim.reverse_iterator
46/73 Test #46: rocprim.reverse_iterator ....................   Pas10:13:26 [46/1844]
      Start 47: rocprim.rocprim_tuple
47/73 Test #47: rocprim.rocprim_tuple .......................   Passed    0.01 sec
      Start 48: rocprim.rocprim_types
48/73 Test #48: rocprim.rocprim_types .......................   Passed    0.01 sec
      Start 49: rocprim.texture_cache_iterator
49/73 Test #49: rocprim.texture_cache_iterator ..............   Passed    0.01 sec
      Start 50: rocprim.thread
50/73 Test #50: rocprim.thread ..............................   Passed    0.08 sec
      Start 51: rocprim.thread_algos
51/73 Test #51: rocprim.thread_algos ........................   Passed    0.43 sec
      Start 52: rocprim.tuple
52/73 Test #52: rocprim.tuple ...............................   Passed    0.01 sec
      Start 53: rocprim.utils_sort_checker
53/73 Test #53: rocprim.utils_sort_checker ..................   Passed    0.01 sec
      Start 54: rocprim.transform_iterator
54/73 Test #54: rocprim.transform_iterator ..................   Passed    0.12 sec
      Start 55: rocprim.type_traits_interface_cpp17
55/73 Test #55: rocprim.type_traits_interface_cpp17 .........   Passed    0.01 sec
      Start 56: rocprim.type_traits_interface_gnupp17
56/73 Test #56: rocprim.type_traits_interface_gnupp17 .......   Passed    0.01 sec
      Start 57: rocprim.type_traits_interface_cpp20
57/73 Test #57: rocprim.type_traits_interface_cpp20 .........   Passed    0.01 sec
      Start 58: rocprim.type_traits_interface_gnupp20
58/73 Test #58: rocprim.type_traits_interface_gnupp20 .......   Passed    0.01 sec
      Start 59: rocprim.no_half_operators
59/73 Test #59: rocprim.no_half_operators ...................   Passed    0.01 sec
      Start 60: rocprim.intrinsics
60/73 Test #60: rocprim.intrinsics ..........................   Passed    0.29 sec
      Start 61: rocprim.intrinsics_atomic
61/73 Test #61: rocprim.intrinsics_atomic ...................   Pas10:13:27 [16/1844]
      Start 62: rocprim.invoke_result
62/73 Test #62: rocprim.invoke_result .......................   Passed    0.01 sec
      Start 63: rocprim.warp_exchange
63/73 Test #63: rocprim.warp_exchange .......................   Passed    0.09 sec
      Start 64: rocprim.warp_load
64/73 Test #64: rocprim.warp_load ...........................   Passed    0.09 sec
      Start 65: rocprim.warp_reduce
65/73 Test #65: rocprim.warp_reduce .........................   Passed    0.17 sec
      Start 66: rocprim.warp_scan
66/73 Test #66: rocprim.warp_scan ...........................   Passed    0.26 sec
      Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
67/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ...   Passed    0.26 sec
      Start 68: rocprim.warp_sort
68/73 Test #68: rocprim.warp_sort ...........................   Passed    0.10 sec
      Start 69: rocprim.warp_store
69/73 Test #69: rocprim.warp_store ..........................   Passed    0.01 sec
      Start 70: rocprim.zip_iterator
70/73 Test #70: rocprim.zip_iterator ........................   Passed    0.01 sec
      Start 71: rocprim.accumulator_t
71/73 Test #71: rocprim.accumulator_t .......................   Passed    0.01 sec
      Start 72: hipgraph.basic
72/73 Test #72: hipgraph.basic ..............................   Passed    0.01 sec
      Start 73: hipgraph.algs
73/73 Test #73: hipgraph.algs ...............................   Passed    0.01 sec

100% tests passed, 0 tests failed out of 73

Total Test time (real) = 175.34 sec
✅ test_rocprim.py PASSED
```

<detail>


## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
evetsso pushed a commit to evetsso/rocm-libraries that referenced this pull request Dec 31, 2025
evetsso pushed a commit to evetsso/rocm-libraries that referenced this pull request Dec 31, 2025
…ck_tile/gfx1250-wmma-impl

Revert "[CK_TILE][GFX1250] WMMA GEMM Support on GFX1250 - Part I"
@ammallya ammallya closed this Mar 13, 2026
Jeff-Huang added a commit that referenced this pull request Apr 23, 2026
Address all 5 reviewer asks on the >2GB KV cache batch-prefill series, plus
two self-found polish items surfaced by an internal CK-aware review pass.

Task #71 — bool kUseGlobalLoad_ -> BlockAttentionKVCacheLoadModeEnum kKVLoadMode_
(poyenc, tile_fmha_traits.hpp:62):
  Adjacent traits-template params (kKVMemoryLayout_, kKVLookupTable_) are
  already BlockAttention*Enum types; the binary kUseGlobalLoad_ stuck out
  as a bool exception. Convert to a 2-value enum {BUFFER_LOAD = 0,
  GLOBAL_LOAD_LDS = 1} living in a new ops/fmha/block/ header so it sits
  alongside its siblings. Touch sites:
  * include/ck_tile/ops/fmha/block/block_attention_kv_load_mode_enum.hpp
    (NEW): the enum class.
  * include/ck_tile/ops/fmha/pipeline/tile_fmha_traits.hpp: rename last
    template param + static member alias.
  * include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_problem.hpp:
    mirror alias rename.
  * include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp:
    add enum header include; class declares static auto kKVLoadMode plus
    derived static bool kUseGlobalLoad = (kKVLoadMode == GLOBAL_LOAD_LDS).
    All 10 internal `if constexpr(kUseGlobalLoad)` sites unchanged so the
    bool boundary is local to one TU. The standalone helper
    kv_offset_array_transform keeps its bool template param (private
    inline; intentional — keeps core/ tile_scatter_gather.hpp out of the
    enum's blast radius).
  * example/ck_tile/01_fmha/fmha_fwd.hpp: fmha_fwd_batch_prefill_traits_
    last template param renamed; static member alias kUseGlobalLoad ->
    kKVLoadMode (default BUFFER_LOAD).
  * include/ck_tile/core/arch/amd_buffer_addressing_builtins.hpp:
    comment-only update.

Task #70 — explicit constructor mem-init for tile_scatter_gather
(asleepzzz, tile_scatter_gather.hpp:1241, comment #3125912056):
  physical_pages_ and page_stride_elements_ were silently zero-initialized
  in the BUFFER_LOAD arm. Today safe (Task #71's positive setter asserts
  prevent misuse), but a future kUseGlobalLoad=true caller that misses a
  setter would get silent data corruption with no compile error. Make
  both fields explicit in the mem-init list so the contract is visible
  at the constructor boundary.

Task #72 — extract dispatcher overflow predicate to a named helper
(poyenc, fmha_batch_prefill.py:225):
  Move the (page_block_size < kN0 && kv_pool_bytes > INT32_MAX) decision
  out of the codegen template into a free helper:
    fmha_batch_prefill_select_kv_load_mode(page_block_size, kN0,
                                           num_total_pages, batch_stride_k,
                                           element_bytes)
  in example/ck_tile/01_fmha/fmha_fwd.hpp. The codegen-emitted dispatcher
  arms now call it with their compile-time kN0/element_bytes substituted,
  so the formula has exactly one source of truth.

Task #73 — symmetric gload/bload kernel-name suffix
(poyenc, fmha_batch_prefill.py:282):
  Match the existing CK convention (e.g., causal/ncausal, sink/nsink) by
  emitting a non-empty token in BOTH branches: '-gload' / '-bload' on
  FmhaFwdApiTrait.name, 'gload_' / 'bload_' on FmhaFwdKernel.name. The
  prior blank-default made it impossible to tell, when grepping JIT
  blob/ output 6 months later, whether a missing marker meant
  'BUFFER_LOAD variant' or 'old codegen revision before the gload branch
  existed'.

Task #74 — replace single-use dependent-false with reusable always_false_v
(poyenc, amd_buffer_addressing_builtins.hpp:1324):
  Promote impl::global_load_lds_arch_unreachable_v from a file-local
  helper into a generic ck_tile::always_false_v utility in
  core/utility/type_traits.hpp. Use it at the original site. The
  variable-template form defers evaluation to instantiation time, so a
  bare `static_assert(false, ...)` would (per CWG-2518 / current Clang)
  fire at parse time and break the whole TU even on never-instantiated
  arches.

Polish I-1 — umbrella header completeness:
  include/ck_tile/ops/fmha.hpp now pulls in the new
  block_attention_kv_load_mode_enum.hpp alongside the other
  BlockAttention*Enum siblings. Without this, downstream consumers that
  rely solely on the umbrella header would miss the enum.

Polish I-2 — overflow-cast robustness in fmha_batch_prefill_select_kv_load_mode:
  Promote every operand of the kv_pool_bytes multiplication to
  long_index_t individually instead of relying on left-to-right
  associativity to widen the chain. A future operand reorder would
  silently truncate; the per-operand cast makes overflow impossible
  regardless of order.

Verified on smci355-gfx950 (gfx950): clean JIT rebuild succeeds; full
op_tests/test_batch_prefill.py sweep passes 30,720 / 30,720 (10,016
skipped, 0 failed) in 30:40 wall. Codegen identifier changes only affect
the renamed template parameter; no register-allocation perturbation
expected on either gfx942 or gfx950 (confirmed by the cross-arch sweep).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants