Skip to content

Enable offload compress for rocthrust#2463

Merged
stanleytsang-amd merged 3 commits into
developfrom
users/stanleytsang-amd/rocthrust-offload-compress
Nov 7, 2025
Merged

Enable offload compress for rocthrust#2463
stanleytsang-amd merged 3 commits into
developfrom
users/stanleytsang-amd/rocthrust-offload-compress

Conversation

@stanleytsang-amd
Copy link
Copy Markdown
Contributor

Motivation

When compiling for a sufficiently large number of GPU architectures, the fat binary can become so large such that linker errors occur. Enabling --offload-compress reduces the binary size, avoiding this problem.

Technical Details

Enables offload-compress ON by default, but can be turned off manually in CMake

Test Plan

Running CI tests

Test Result

Submission Checklist

Comment thread projects/rocthrust/rmake.py Outdated
Comment thread projects/rocthrust/rmake.py
@stanleytsang-amd stanleytsang-amd changed the title Users/stanleytsang amd/rocthrust offload compress Enable offload compress for rocthrust Nov 4, 2025
@stanleytsang-amd stanleytsang-amd merged commit 27bdd25 into develop Nov 7, 2025
23 checks passed
@stanleytsang-amd stanleytsang-amd deleted the users/stanleytsang-amd/rocthrust-offload-compress branch November 7, 2025 18:16
assistant-librarian Bot pushed a commit to ROCm/rocThrust that referenced this pull request Nov 7, 2025
Enable offload compress for rocthrust

## Motivation

When compiling for a sufficiently large number of GPU architectures, the
fat binary can become so large such that linker errors occur. Enabling
--offload-compress reduces the binary size, avoiding this problem.

## Technical Details

Enables offload-compress ON by default, but can be turned off manually
in CMake

## Test Plan

Running CI tests

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
ammallya pushed a commit that referenced this pull request Feb 3, 2026
* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: c8eb2f9]
ammallya pushed a commit that referenced this pull request Feb 3, 2026
* unify pipeline signature with existing example

* iwyu

* move stuff around in load-tile-transpose

* cleanups in batched transpose pipeline

* comments

* use same inputs size

* cleaner printf

* print host args

* use 64 block sides in the 37_transpose example

* roll back grid dimension size adjustment for 37_transpose example

* transpose grid for 37_transpose to unify with 35_batched_transpose

* unify grid computation logic

* make policy methods device only (since they are used only on device from the pipeline)

* more host/device attribute cleanups

* copy over problem

* move over pipeline and policy

* add switch to batched transpose api

* make the lds problem more similar to original problem

* factor out logic into traits

* factor out conditional compilation into trait parameter

* propagate pipeline to args

* unhardcode pipeline dispatch parameter

* refactor vector size

* put warp tile out of dispatch

* rename template parameter for trait

* rewrite vector size in terms of problem

* mark policy-internal struct variable as device

* factor out input distribution and thread access pattern from policies

* reword vector size

* use datatype across batched transpose pipelines, problems and kernel

* remove transpose traits from lds pipeline

* add padding to the lds pipeline *interface*

* add comment

* remove ck_tile example #37

* update cmakelists

* add test for new pipeline

* update batched transpose test

* roll back load_tile_transpose changes

* remove comments

* pack dispatch parameters into a config

* padM can be enabled

* adjust lds vector size to enable padding along N

* update test

* clean up logic

* swap m/n input vector size

* adjust perf test script

* sweep over C/W in perf test

* count both read and written bytes into bandwidth (x2 the number)

* clang-format

* widen size range for perf test

* remove 64k x 64k case; it's too large for index

* remove thread tile from dispatch

* Solve merge conflict

* fix compile

* modify the transpose

* solve the test error and clang format

* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)

* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Fixing 0ms and inf GB/s issue in img2col (#2565)

issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

* merge with develop

* solve clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

[ROCm/composable_kernel commit: 821cd26]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants