Skip to content

resolve Nan issue when using tf - SWDEV-285232#911

Merged
atamazov merged 1 commit into
developfrom
fix_swd_285232
May 11, 2021
Merged

resolve Nan issue when using tf - SWDEV-285232#911
atamazov merged 1 commit into
developfrom
fix_swd_285232

Conversation

@shaojiewang
Copy link
Copy Markdown
Contributor

@shaojiewang shaojiewang commented May 7, 2021

Regarding https://ontrack-internal.amd.com/browse/SWDEV-285232. resolve Nan issue.
With this branch, nan issue is gone.

  1. remove aggressive padding method for weights' loading.
  2. add condition to check in solver.

… add more valid check condition for fp16 fwd
@atamazov
Copy link
Copy Markdown
Contributor

atamazov commented May 7, 2021

Shall we test it on CI right now?

@shaojiewang
Copy link
Copy Markdown
Contributor Author

Shall we test it on CI right now?

No, not yet.

@shaojiewang
Copy link
Copy Markdown
Contributor Author

Self test done, ready for review.

@shaojiewang shaojiewang marked this pull request as ready for review May 8, 2021 06:25
@shaojiewang shaojiewang changed the title resolve Nan issue when using tf resolve Nan issue when using tf - SWDEV-285232 May 8, 2021
Copy link
Copy Markdown
Contributor

@carlushuang carlushuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This fix all the fwd kernels by setting the proper range check in sgpr[2:3] of buffer load instruction

Copy link
Copy Markdown
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see anything bad))

@atamazov
Copy link
Copy Markdown
Contributor

atamazov commented May 8, 2021

@shaojiewang Can you please add regression tests for this issue?

@shaojiewang
Copy link
Copy Markdown
Contributor Author

shaojiewang commented May 10, 2021

@shaojiewang Can you please add regression tests for this issue?

The test is done with tensorflow as described in SWDEV-285232. And the problem is not observed when tested by MIOpenDriver. Do we pass the e2e test to QA?

@shaojiewang shaojiewang requested a review from junliume May 10, 2021 23:23
@junliume
Copy link
Copy Markdown
Contributor

@atamazov CI passed, large number of file while actual changes are small and have benign looks. As @shaojiewang commented "tested for about 1 day. No nan issue is observed." Ready to merge?

@atamazov
Copy link
Copy Markdown
Contributor

Yes. But we need to think about regression tests -- I mean the ones that ran on our CI and ensure that the same bug will not appear in the future.

@atamazov atamazov merged commit 9d1380b into develop May 11, 2021
@atamazov atamazov deleted the fix_swd_285232 branch June 23, 2021 13:19
assistant-librarian Bot added a commit that referenced this pull request Oct 24, 2025
[MIOpen] Address some leftover issues from the Batchnorm
 tuning API (#911)

The API version of MIOPEN_FIND_ENFORCE was merged via a separate PR, but
there were a few leftover review issues and documentation updates pending that are addressed
by these changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants