fix a potential issue in the pipeline pass by scxiao · Pull Request #217 · ROCm/triton

scxiao · 2023-05-16T22:07:06Z

The fix here is relate to the function call of op.getResult(0), in which an op could have zero results, then this function call will have an assert failed.

The fix is to check the number of results of an op before calling this function. Only if an op has 1 or more results, will the function op.getResult called.

zhanglx13 · 2023-05-17T01:14:42Z

@scxiao Can you provide more context about the problem?

On which GPU did you see this error? Is it only on MI300?
For which input did you see this error?
For the input that triggers the error, what is op look like? Why does it have zero result?

scxiao · 2023-05-17T15:30:17Z

@scxiao Can you provide more context about the problem?

On which GPU did you see this error? Is it only on MI300?
This is a general problem if we upstream the llvm project to the latest. Actually, yesterday, the openai/triton just upstreamed the llvm it uses and the changes they made include what I did in this PR. see https://github.com/openai/triton/blob/17eb982771e939406b6f801d80d51498507893dd/lib/Dialect/TritonGPU/Transforms/Pipeline.cpp#L391.

For which input did you see this error?
I observed this problem when I ran the test_gemm.py. Specifically, this case can trigger the problem in this PR:
pytest test_gemm.py::test_gemm[64-32-128-4-64-32-64]

For the input that triggers the error, what is op look like? Why does it have zero result?
This is related to the for loop used in the gemm implementation. The for loop is optimized in the pipeline pass. It is rewritten to a bunch of OPs with the scf.yield as the terminator. scf.yield has 0 result, but in this pass, it assumes it has a default one result, which causes an assert failed in the function call of op.getResult(0). The reason it did not fail before is the assert as added to getResult() recently, and then this problem is surfaced.

zhanglx13 · 2023-05-17T15:45:23Z

@scxiao So with the current llvm, getResult(0) does not fail even if there is no result to get for the op, right?
I just want to see a confirmation that upstream triton also has this problem and a fix for this problem. Then maybe we should wait for the PR#219 to take in the fix.
I also noticed that this PR has two more checks (op.getNumResults() > 0) compared to the upstream fix. Is there a reason for this difference?

scxiao · 2023-05-24T18:55:08Z

Changes in this PR is also included in in @jayfurmanek upstream PR #219, so close this PR.

…217) Implements the proposal in ROCm/triton-internal#1318. For Gluon instead of following the tokens, which are difficult to loop carry in gluon, we define the waitCnt as number of commit groups ops in Gluon. The UpdateAsyncWaitCnt pass will then walk the IR backwards and computes the minimum number of created async intrinsic by waitCnt outstanding commit groups. Current limitations are: We treat dynamic and static loops the same. This means we could get conservative loads if the prologue prefetch loop is not unrolled. For our Gluon kernels they were always unrolled. I can add support in a follow up PR since this one is already quite large. scf.ExecuteRegion is not handled to keep the PR minimal. But adding support for it is trivial when the new PingPong lands Warp specialization is not handled. We will probably use LDS/Named barriers so this might not be needed? Adding it should also be not a problem in case we need it. Note that for TDM each tensor_load/store in TGGIR will create exactly one intrinsic so we can use the count from Gluon directly.

fix a potential issue in the pipeline pass

7c8586e

scxiao requested review from jayfurmanek and zhanglx13 May 16, 2023 22:07

scxiao closed this May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix a potential issue in the pipeline pass#217

fix a potential issue in the pipeline pass#217
scxiao wants to merge 1 commit intotriton-mlirfrom
pipeline-potential-issue

scxiao commented May 16, 2023

Uh oh!

zhanglx13 commented May 17, 2023

Uh oh!

scxiao commented May 17, 2023

Uh oh!

zhanglx13 commented May 17, 2023 •

edited

Loading

Uh oh!

scxiao commented May 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scxiao commented May 16, 2023

Uh oh!

zhanglx13 commented May 17, 2023

Uh oh!

scxiao commented May 17, 2023

Uh oh!

zhanglx13 commented May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scxiao commented May 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhanglx13 commented May 17, 2023 •

edited

Loading