Skip to content

MIOpen error tolerance#912

Open
assistant-librarian[bot] wants to merge 25 commits into
developfrom
import/develop/StreamHPC_MIOpen/miopen-error-tolerance
Open

MIOpen error tolerance#912
assistant-librarian[bot] wants to merge 25 commits into
developfrom
import/develop/StreamHPC_MIOpen/miopen-error-tolerance

Conversation

@assistant-librarian
Copy link
Copy Markdown
Contributor

Scope

While testing the variant 0 for bnorm backward spatial single, we observed that the error tolerance in MIOpenDriver is sub-optimal. For instance, the following results were observed

Notice that the execution is successful, according to the driver checks. However, the error reported is high so this should actually be failing.

Notes for the reviewer

  • Fixed the definition because it was too big compared to the rms values being computed (due to the rms normalization done). The method seems to also be using the same value for as the new implementation in .
  • Fixed the normalization of the rms. Previously we were finding the maximum absolute value from both the reference values and the results, but this may not be correct because it reduces the significance of the rms: if the results differ a lot from the reference values (e.g. ref values are order of 10 and results are order 1000) then the normalization will divide the rms obtained (which will be order of 100) by a number order of 1000 and then the normalized rms will be 100/1000 = 0.1 which is way lower than what it should be.
    • The new implementation only takes into account the reference results for computing the normalization factor

🔁 Imported from ROCm/MIOpen#3638
🧑‍💻 Originally authored by @Beanavil

@assistant-librarian assistant-librarian Bot requested a review from a team as a code owner July 28, 2025 20:28
jonatluu pushed a commit to jonatluu/rocm-libraries that referenced this pull request Jul 28, 2025
* mariner OS changing name to azurelinux

(cherry picked from commit 78f398e)

[ROCm/hipBLAS commit: 1e5eb09]
ammallya pushed a commit that referenced this pull request Sep 18, 2025
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.1 to 1.18.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v1.17.1...v1.18.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rocSOLVER commit: ee9e185]
@sikba sikba mentioned this pull request Jan 12, 2026
@sikba sikba force-pushed the import/develop/StreamHPC_MIOpen/miopen-error-tolerance branch from 9af1e6a to c782460 Compare January 12, 2026 16:58
BrianHarrisonAMD pushed a commit that referenced this pull request Jan 13, 2026
## Motivation

During ivestigating MIOpen error tolerance #912 , CI failed for RoPE
kernels in bf16.

## Technical Details

This PR fixes the wrong casting and negating operation order for fwd and
bwd RoPE.

## Test Plan

All tests pass. For some driver calls, the error got smaller. After the
error tolerance fix applied, this kernel will pass the tests too.
@sikba sikba force-pushed the import/develop/StreamHPC_MIOpen/miopen-error-tolerance branch from c782460 to 578cb4a Compare January 13, 2026 15:22
@sikba sikba force-pushed the import/develop/StreamHPC_MIOpen/miopen-error-tolerance branch from 578cb4a to 37be11f Compare January 15, 2026 16:13
BradPepersAMD pushed a commit that referenced this pull request Feb 17, 2026
## Motivation

During ivestigating MIOpen error tolerance #912 , CI failed for the
above mentioned kernel.

## Technical Details

BF16 FLOAT maps to ushort storage type.
Accumulating directly in FLOAT causes arithmetic on BF16 bit patterns
and produces corrupted dx.

This PR replaces FLOAT accumulation with FLOAT_ACCUM.
Adds explicit convert-in/convert-out through CVT_FLOAT2ACCUM /
CVT_ACCUM2FLOAT.


## Test Result

The previously failing tests with the tightened error check are now also
passing.
SamuelReeder pushed a commit that referenced this pull request Feb 17, 2026
## Motivation

During ivestigating MIOpen error tolerance #912 , CI failed for the
above mentioned kernel.

## Technical Details

BF16 FLOAT maps to ushort storage type.
Accumulating directly in FLOAT causes arithmetic on BF16 bit patterns
and produces corrupted dx.

This PR replaces FLOAT accumulation with FLOAT_ACCUM.
Adds explicit convert-in/convert-out through CVT_FLOAT2ACCUM /
CVT_ACCUM2FLOAT.


## Test Result

The previously failing tests with the tightened error check are now also
passing.
kamuruga08 pushed a commit that referenced this pull request Feb 19, 2026
## Motivation

During ivestigating MIOpen error tolerance #912 , CI failed for the
above mentioned kernel.

## Technical Details

BF16 FLOAT maps to ushort storage type.
Accumulating directly in FLOAT causes arithmetic on BF16 bit patterns
and produces corrupted dx.

This PR replaces FLOAT accumulation with FLOAT_ACCUM.
Adds explicit convert-in/convert-out through CVT_FLOAT2ACCUM /
CVT_ACCUM2FLOAT.


## Test Result

The previously failing tests with the tightened error check are now also
passing.
kamuruga08 pushed a commit that referenced this pull request Feb 19, 2026
## Motivation

During ivestigating MIOpen error tolerance #912 , CI failed for the
above mentioned kernel.

## Technical Details

BF16 FLOAT maps to ushort storage type.
Accumulating directly in FLOAT causes arithmetic on BF16 bit patterns
and produces corrupted dx.

This PR replaces FLOAT accumulation with FLOAT_ACCUM.
Adds explicit convert-in/convert-out through CVT_FLOAT2ACCUM /
CVT_ACCUM2FLOAT.


## Test Result

The previously failing tests with the tightened error check are now also
passing.
mgates3 pushed a commit to mgates3/rocm-libraries that referenced this pull request Mar 6, 2026
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.1 to 1.18.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v1.17.1...v1.18.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
jovanau pushed a commit to jovanau/rocm-libraries that referenced this pull request Mar 19, 2026
## Motivation

During ivestigating MIOpen error tolerance ROCm#912 , CI failed for the
above mentioned kernel.

## Technical Details

BF16 FLOAT maps to ushort storage type.
Accumulating directly in FLOAT causes arithmetic on BF16 bit patterns
and produces corrupted dx.

This PR replaces FLOAT accumulation with FLOAT_ACCUM.
Adds explicit convert-in/convert-out through CVT_FLOAT2ACCUM /
CVT_ACCUM2FLOAT.


## Test Result

The previously failing tests with the tightened error check are now also
passing.
@brentmaas brentmaas requested review from EwanC and brentmaas April 17, 2026 12:04
Copy link
Copy Markdown
Contributor

@brentmaas brentmaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but the PR description that got ported over from the original MIOpen PR is missing all the information in backticks. It would be worth restoring that information so that the git commit message retains those details when this is merged (assuming the PR description is used as part of the commit message).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants