Skip to content

[rocPRIM] Reset internal hip error for tests that run out of memory#75

Merged
umfranzw merged 1 commit into
ROCm:developfrom
umfranzw:fix_out_of_mem_tests
May 28, 2025
Merged

[rocPRIM] Reset internal hip error for tests that run out of memory#75
umfranzw merged 1 commit into
ROCm:developfrom
umfranzw:fix_out_of_mem_tests

Conversation

@umfranzw
Copy link
Copy Markdown
Contributor

The behaviour of hipGetLastError is changing in HIP 7.0. Previously the error that was reported was cleared on each HIP API call. This means that hipGetLastError reported any error that occurred during the last HIP API call.

Moving forward, the error that's reported will only be cleared on each call to hipGetLastError. This means that hipGetLastError will report any error that has occurred since the last call to hipGetError.

Some of our tests rely on observing a return value of hipErrorOutOfMemory from hipMalloc when an allocation is too large for a given GPU architecture's memory system. This sets the internal HIP error, and it's not cleared before subsequent tests call hipGetLastError, causing them to fail.

This change adds extra calls to hipGetLastError to clear the error (for future tests) in cases where tests run out of memory.

@umfranzw umfranzw force-pushed the fix_out_of_mem_tests branch from 777b44a to 280f15c Compare May 26, 2025 19:30
@umfranzw umfranzw changed the title Reset internal hip error for tests that run out of memory [rocPRIM] Reset internal hip error for tests that run out of memory May 26, 2025
The behaviour of hipGetLastError is changing in HIP 7.0. Previously the error that was reported was cleared on each HIP API call. This means that hipGetLastError reported any error that occurred during the last HIP API call.

Moving forward, the error that's reported will only be cleared on each call to hipGetLastError. This means that hipGetLastError will report any error that has occurred since the last call to hipGetError.

Some of our tests rely on observing a return value of hipErrorOutOfMemory from hipMalloc when an allocation is too large for a given GPU architecture's memory system. This sets the internal HIP error, and it's not cleared before subsequent tests call hipGetLastError, causing them to fail.

This change adds extra calls to hipGetLastError to clear the error (for future tests) in cases where tests run out of memory.
@umfranzw umfranzw force-pushed the fix_out_of_mem_tests branch from 280f15c to 869f2e3 Compare May 26, 2025 22:18
Copy link
Copy Markdown
Contributor

@NguyenNhuDi NguyenNhuDi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@danielsu-amd
Copy link
Copy Markdown
Contributor

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@umfranzw umfranzw merged commit 267d983 into ROCm:develop May 28, 2025
30 of 32 checks passed
assistant-librarian Bot pushed a commit to ROCm/rocPRIM that referenced this pull request Jun 2, 2025
[rocPRIM] Reset internal hip error for tests that run out of
 memory (#75)

The behaviour of hipGetLastError is changing in HIP 7.0. Previously the
error that was reported was cleared on each HIP API call. This means
that hipGetLastError reported any error that occurred during the last
HIP API call.

Moving forward, the error that's reported will only be cleared on each
call to hipGetLastError. This means that hipGetLastError will report any
error that has occurred since the last call to hipGetError.

Some of our tests rely on observing a return value of
hipErrorOutOfMemory from hipMalloc when an allocation is too large for a
given GPU architecture's memory system. This sets the internal HIP
error, and it's not cleared before subsequent tests call
hipGetLastError, causing them to fail.

This change adds extra calls to hipGetLastError to clear the error (for
future tests) in cases where tests run out of memory.
ammallya pushed a commit that referenced this pull request Sep 24, 2025
…75)

* Initial test setup and implementation of first instance of sending the graph to the backend library from the frontend.

* add comment for weak ptr

* implement almost everything up until execute. todo, proper validate, setting up variant pack

* implement happy path tests for the graph setup functions

* Add packing of variant pack and calling execute.

* fix most code review concerns

* fix the backend_execute_api test

* Add another test logging initializer that uses default spdlog functionality rather than the callback system.  Callback system seems to print all logs at the end of the test rather than as they happened.

* fix bad error logging

* add test for formatters

* add converter to go from frontend to backend heur mode
ammallya pushed a commit that referenced this pull request Sep 24, 2025
…75)

* Initial test setup and implementation of first instance of sending the graph to the backend library from the frontend.

* add comment for weak ptr

* implement almost everything up until execute. todo, proper validate, setting up variant pack

* implement happy path tests for the graph setup functions

* Add packing of variant pack and calling execute.

* fix most code review concerns

* fix the backend_execute_api test

* Add another test logging initializer that uses default spdlog functionality rather than the callback system.  Callback system seems to print all logs at the end of the test rather than as they happened.

* fix bad error logging

* add test for formatters

* add converter to go from frontend to backend heur mode

[ROCm/hipDNN commit: 494969d]
evetsso pushed a commit to evetsso/rocm-libraries that referenced this pull request Dec 31, 2025
* [gfx1250] Fix example issues for gfx125x

1. Refine KGroup in GridwiseMoeGemm, dequant pipeline doesn't support KGroup for now
2. Enable example example_moe_gemm1_xdl_pk_i4, example_moe_gemm2_xdl_pk_i4 and example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16 for gfx125x

* [gfx1250]  Workaround hipOccupancyMaxActiveBlocksPerMultiprocessor return value

hipOccupancyMaxActiveBlocksPerMultiprocessor return 0 on gfx125x, and it causes all streamk example crash

workaround: set the min value to 1.

---------

Co-authored-by: Qun Lin <qlin@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants