Skip to content

fix a potential issue in the pipeline pass#217

Closed
scxiao wants to merge 1 commit intotriton-mlirfrom
pipeline-potential-issue
Closed

fix a potential issue in the pipeline pass#217
scxiao wants to merge 1 commit intotriton-mlirfrom
pipeline-potential-issue

Conversation

@scxiao
Copy link
Copy Markdown

@scxiao scxiao commented May 16, 2023

The fix here is relate to the function call of op.getResult(0), in which an op could have zero results, then this function call will have an assert failed.

The fix is to check the number of results of an op before calling this function. Only if an op has 1 or more results, will the function op.getResult called.

@scxiao scxiao requested review from jayfurmanek and zhanglx13 May 16, 2023 22:07
@zhanglx13
Copy link
Copy Markdown

@scxiao Can you provide more context about the problem?

  1. On which GPU did you see this error? Is it only on MI300?
  2. For which input did you see this error?
  3. For the input that triggers the error, what is op look like? Why does it have zero result?

@scxiao
Copy link
Copy Markdown
Author

scxiao commented May 17, 2023

@scxiao Can you provide more context about the problem?

  1. On which GPU did you see this error? Is it only on MI300?
    This is a general problem if we upstream the llvm project to the latest. Actually, yesterday, the openai/triton just upstreamed the llvm it uses and the changes they made include what I did in this PR. see https://github.com/openai/triton/blob/17eb982771e939406b6f801d80d51498507893dd/lib/Dialect/TritonGPU/Transforms/Pipeline.cpp#L391.
  1. For which input did you see this error?
    I observed this problem when I ran the test_gemm.py. Specifically, this case can trigger the problem in this PR:
    pytest test_gemm.py::test_gemm[64-32-128-4-64-32-64]
  1. For the input that triggers the error, what is op look like? Why does it have zero result?
    This is related to the for loop used in the gemm implementation. The for loop is optimized in the pipeline pass. It is rewritten to a bunch of OPs with the scf.yield as the terminator. scf.yield has 0 result, but in this pass, it assumes it has a default one result, which causes an assert failed in the function call of op.getResult(0). The reason it did not fail before is the assert as added to getResult() recently, and then this problem is surfaced.

@zhanglx13
Copy link
Copy Markdown

zhanglx13 commented May 17, 2023

@scxiao So with the current llvm, getResult(0) does not fail even if there is no result to get for the op, right?
I just want to see a confirmation that upstream triton also has this problem and a fix for this problem. Then maybe we should wait for the PR#219 to take in the fix.
I also noticed that this PR has two more checks (op.getNumResults() > 0) compared to the upstream fix. Is there a reason for this difference?

@scxiao
Copy link
Copy Markdown
Author

scxiao commented May 24, 2023

Changes in this PR is also included in in @jayfurmanek upstream PR #219, so close this PR.

@scxiao scxiao closed this May 24, 2023
guacamoleo pushed a commit that referenced this pull request Dec 10, 2025
…217)

Implements the proposal in ROCm/triton-internal#1318. For Gluon instead of following the tokens, which are difficult to loop carry in gluon, we define the waitCnt as number of commit groups ops in Gluon. The UpdateAsyncWaitCnt pass will then walk the IR backwards and computes the minimum number of created async intrinsic by waitCnt outstanding commit groups.

Current limitations are:

We treat dynamic and static loops the same. This means we could get conservative loads if the prologue prefetch loop is not unrolled. For our Gluon kernels they were always unrolled. I can add support in a follow up PR since this one is already quite large.
scf.ExecuteRegion is not handled to keep the PR minimal. But adding support for it is trivial when the new PingPong lands
Warp specialization is not handled. We will probably use LDS/Named barriers so this might not be needed? Adding it should also be not a problem in case we need it.
Note that for TDM each tensor_load/store in TGGIR will create exactly one intrinsic so we can use the count from Gluon directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants