Skip to content

[rocBLAS] [hipBLAS] CHANGELOG update for 7.2#2668

Merged
amcamd merged 9 commits into
developfrom
users/amcamd/CHANGELOG_7.2
Nov 14, 2025
Merged

[rocBLAS] [hipBLAS] CHANGELOG update for 7.2#2668
amcamd merged 9 commits into
developfrom
users/amcamd/CHANGELOG_7.2

Conversation

@amcamd
Copy link
Copy Markdown
Contributor

@amcamd amcamd commented Nov 13, 2025

Motivation

Update CHANGELOG.md for 7.2.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 13, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #2668       +/-   ##
============================================
+ Coverage    55.20%   73.95%   +18.75%     
============================================
  Files           63      365      +302     
  Lines         7553    72116    +64563     
  Branches       824     6306     +5482     
============================================
+ Hits          4169    53329    +49160     
- Misses        3013    15028    +12015     
- Partials       371     3759     +3388     
Flag Coverage Δ
hipBLAS 90.64% <ø> (?)
rocBLAS 67.07% <ø> (?)
rocRAND ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 428 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@TorreZuk TorreZuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@amd-jnovotny amd-jnovotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a handful of minor comments and suggestions.

Comment thread projects/hipblas/CHANGELOG.md Outdated
Comment thread projects/hipblas/CHANGELOG.md Outdated
Comment thread projects/rocblas/CHANGELOG.md Outdated
Comment thread projects/rocblas/CHANGELOG.md Outdated
Comment thread projects/rocblas/CHANGELOG.md Outdated
Comment thread projects/hipblas/CHANGELOG.md Outdated
Comment thread projects/rocblas/CHANGELOG.md Outdated
amcamd and others added 7 commits November 13, 2025 16:51
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Copy link
Copy Markdown
Contributor

@amd-jnovotny amd-jnovotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@amcamd amcamd merged commit 7952129 into develop Nov 14, 2025
11 of 18 checks passed
@amcamd amcamd deleted the users/amcamd/CHANGELOG_7.2 branch November 14, 2025 17:39
assistant-librarian Bot pushed a commit to ROCm/hipBLAS that referenced this pull request Nov 14, 2025
[rocBLAS] [hipBLAS] CHANGELOG update for 7.2

## Motivation

Update CHANGELOG.md for 7.2.
assistant-librarian Bot pushed a commit to ROCm/rocBLAS that referenced this pull request Nov 14, 2025
[rocBLAS] [hipBLAS] CHANGELOG update for 7.2

## Motivation

Update CHANGELOG.md for 7.2.
amcamd added a commit that referenced this pull request Nov 21, 2025
## Motivation

Update CHANGELOG.md for 7.2.

---------

Co-authored-by: Andrew Chapman <anchapman@ctr2-alola-login-02.amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Andrew Chapman <anchapman@ctr2-alola-ctrl-01.amd.com>
jharryma pushed a commit that referenced this pull request Dec 1, 2025
## Motivation

cherry-pick update CHANGELOG.md for 7.2.

Co-authored-by: Andrew Chapman <anchapman@ctr2-alola-login-02.amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Andrew Chapman <anchapman@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: Val Movsik <160653499+vamovsik@users.noreply.github.com>
ammallya pushed a commit that referenced this pull request Feb 3, 2026
[ROCm/composable_kernel commit: c0c2ded]
ammallya pushed a commit that referenced this pull request Feb 3, 2026
* Add stride validation to prevent segfault in blockscale GEMM

* run clang-format

* Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* added stride length checking to more gemm examples in ckprofiler

* ran clang format

* added validation header and implement in core gemm operations

* remove ck_tile transpose and gemm stages from CI (#2646)

* update CK build instruction step 4 (#2563)

Co-authored-by: Aviral Goel <aviral.goel@amd.com>

* Fixes to  "General 2D Reduction Kernel" (#2535) (#2656)

* fix reduce2d

- revret the combine_partial_results() chnages
- remove auto from function def

* clang-format

* enable aiter test_mha in daily CI (#2659)

* feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582)

* feat(copy_kernel): add basic copy kernel example with documentation

* docs(CHANGELOG): Updated changelog

* chore: performed clang format

* Update example/ck_tile/39_copy/copy_basic.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* fix(terminology): follow amd terms

* extract elementwise copy to a new kernel

* fix(copy_kernel): bug in verification

* add comments about vgpr usage

* lint and nits

* add notes and comments

* print hostTensor via stream

* print hostTensor via stream

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* [CK_TILE] FMHA BWD Optimization For GFX950 (#2628)

* simplify fmha_bwd_kernel MakeKargs & dq_dram_window

* simply duplicate

* trload pipeline

* Try two-stage

* add prefetch

* optimize & iglp

* Fix num_byte calculations to use nhead_k for K & V size (#2653)

Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.

Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```

After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```

* [CK_TILE] FMHA BWD Decode Pipeline (#2643)

* Fix distr

* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr

* decode 16x16 o2

* fix (#2668)

* Optimize fmha fwd decode & prefill for gfx950 (#2641)

* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

* Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)

This reverts commit 327bf40.

* added batch stride checking to batched gemm ops in profiler

* removed batch stride validation

* removed batched stride validation again

* Update include/ck/library/utility/profiler_validation_common.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* refactor function names

* added gemm stride checking to more profiler gemm operations

* run clang format

* add stride checkign to 01 gemm example

* rename from profiler to validation common, used for examples and profiler

* build of ckProfiler success

* update file headers

---------

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
Co-authored-by: Cameron Shinn <camerontshinn@gmail.com>
Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 60320e9]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants