Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flang][OpenMP] Implement more robust loop-nest detection logic #127

Closed

Conversation

ergawy
Copy link

@ergawy ergawy commented Jul 30, 2024

The previous loop-nest detection algorithm fell short, in some cases, to detect whether a pair of do concurrent loops are perfectly nested or not. This is a re-implementation using forward and backward slice extraction algorithms to compare the set of ops required to setup the inner loop bounds vs. the set of ops nested in the outer loop other thatn the nested loop itself.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please alter the test to not rely on debug statements for failure/success determination.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not ideal indeed. Do you have any suggestions on how to do this differently?

I thought about doing this as a unit test, but setting up the test and later reading the test I think would be more complicated than a lit test the clearly shows what the loops look like and what the outcome should be.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can a special flag to print loop info to llvm::outs(). But I am not sure this is worth it tbh.

Copy link

@Meinersbur Meinersbur Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like relying on debug output either. It randomly interleaves with stdout and only works with assertions-builds. Additionally, it makes the test dependent on how often/in which order the isPerfectlyNested function is called internally, making it sensitive to even NFC patches. But I have seen this in LLVM and Clang enough to be consider a established pattern there, although not for MLIR/Flang.

If you do this, you still must:

  1. Do not redirect/pipe stdout and stderr at the same time. either guarentee that just one of the ones is used or 2> instead &>. I've seen tests fail under Windows because of this.
  2. Add REQUIRES: asserts or it will fail in release builds

In MLIR I indeed usually see an internal option enabling additional printing. or print debug counters E.g. -test-print-shape-mapping, or a pass that just prints the analysis result, e.g. -pass-pipeline=...test-print-dominance, or test the diagnostic from -Rpass output.

Since it is a transformation pass, one would typically (also) test whether the output is what was expected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info @Meinersbur, really useful.

I used 2> and REQUIRES: asserts.

Since it is a transformation pass, one would typically (also) test whether the output is what was expected.

All other do-concurrent-conversion tests verify the output. I wanted this one to test only one thing which is loop-nest detection. My reasoning is that this test isolates this particular part of the pass so that we debug issues in nest detection more easily.

@ergawy
Copy link
Author

ergawy commented Aug 1, 2024

Ping! Please 🙏 take a look when you have time.

Copy link

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because:

  1. Any operation that does not have side-effects can just be sunk into the inner loop or hoisted outside the outer loop, no matter whether it use used to compute the inner loop bounds or not. If it is invariant, an optimization can hoist it out again. I think this is the relevant property: "able to move all the code away".
  2. Ops might be needed to compute the inner loop's but have side-effects, e.g. a function call accesses a global variable.

I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?

@@ -36,7 +36,8 @@ namespace fir {
#include "flang/Optimizer/Transforms/Passes.h.inc"
} // namespace fir

#define DEBUG_TYPE "fopenmp-do-concurrent-conversion"
#define DEBUG_TYPE "do-concurrent-conversion"
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE << "]: ")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical form is LLVM_DEBUG(dbgs() << "text"). I don't think introducing new patterns for a single file is a good idea. DBGS may easily conflict with something else, as was the case with DEBUG which was eventually renamed to LLVM_DEBUG. Getting only a specific debug type can be done via cmdline: -mllvm -debug-only do-concurrent-conversion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same pattern is used in a lot of places in MLIR (38 existing times). For example: Dialect/Transform/IR/TransformOps.cpp, Dialect/Linalg/Transforms/Vectorization.cpp, and Dialect/Linalg/Transforms/Transforms.cpp, ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmh, OK. I don't think it is a good idea but apparently some MLIR folks do.

Copy link

@Meinersbur Meinersbur Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like relying on debug output either. It randomly interleaves with stdout and only works with assertions-builds. Additionally, it makes the test dependent on how often/in which order the isPerfectlyNested function is called internally, making it sensitive to even NFC patches. But I have seen this in LLVM and Clang enough to be consider a established pattern there, although not for MLIR/Flang.

If you do this, you still must:

  1. Do not redirect/pipe stdout and stderr at the same time. either guarentee that just one of the ones is used or 2> instead &>. I've seen tests fail under Windows because of this.
  2. Add REQUIRES: asserts or it will fail in release builds

In MLIR I indeed usually see an internal option enabling additional printing. or print debug counters E.g. -test-print-shape-mapping, or a pass that just prints the analysis result, e.g. -pass-pipeline=...test-print-dominance, or test the diagnostic from -Rpass output.

Since it is a transformation pass, one would typically (also) test whether the output is what was expected.

@@ -0,0 +1,77 @@
! Tests loop-nest detection algorithm for do-concurrent mapping.

! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-parallel=host \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do-concurrent-conversion is a MLIR-to-MLIR pass. Those tests usually only contains the input MLIR of the pass so we don't test more than necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to show loops on the Fortran source level. Just makes it easy to correlate for which loop nests do we detect that they are perfectly nested.

If you don't think this is a good enough arugment to have the test on the fortran level, I will replace it with MLIR instead.

flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp Outdated Show resolved Hide resolved
flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp Outdated Show resolved Hide resolved
flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp Outdated Show resolved Hide resolved
@ergawy ergawy force-pushed the more_robust_loop_nest_detection branch from 83776bf to 4c3d9f1 Compare August 2, 2024 05:05
@ergawy
Copy link
Author

ergawy commented Aug 2, 2024

I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?

This happens if cases like the following (see complete sample here):

  do concurrent(i=1:n, j=1:bar(n*m, n/m))
    a(i) = n
  end do

If you look the IR, you will see:

    fir.do_loop %arg1 = %42 to %44 step %c1 unordered {
      ...
      %53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
      %54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
      %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
      hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1
      hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1
      %56 = fir.convert %55 : (i32) -> index
      ...
      fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered {
        ...
      }
    }

The problem here are the hlfir.end_associate ops. Even though the "effectively 2" loops are perfectly nested, we have these hlfir.end_associate ops that are not part of the slice responsible for computing the upper bound (in this case) of the inner loop even though they are in-practice exist only for the purpose of that computation.

@ergawy
Copy link
Author

ergawy commented Aug 2, 2024

The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because: ...

These are very good points that I did not consider to be honest. But my conclusion from what you mentioned about side-effects is that flang potentially emits wrong IR!

If you take again the same sample mentioned the previous comment, shouldn't flang have emitted:

      %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32

before the outermost loop (the i loop)?

@mjklemm
Copy link

mjklemm commented Aug 2, 2024

The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because: ...

These are very good points that I did not consider to be honest. But my conclusion from what you mentioned about side-effects is that flang potentially emits wrong IR!

How would that happen? From a Fortran perspective there should not be any side effects in the code inside the DO CONCURRENT.

@ergawy
Copy link
Author

ergawy commented Aug 2, 2024

How would that happen? From a Fortran perspective there should not be any side effects in the code inside the DO CONCURRENT.

flang is potentially emitting wrong IR atm (as I mentioned but for different reasons in my previous reply). Nothing is preventing bar from mutating global state (see Fortran code and corresponding HLFIR IR in this comment).

@ergawy
Copy link
Author

ergawy commented Aug 2, 2024

Just as a follow up, if you modify the loop I posted earlier to be:

  do concurrent(i=1:n, j=1:bar(n*m, n/m))
    a(i) = bar(n,m)
  end do

You get:

/tmp/test.f90:15:5: error: Impure procedure 'bar' may not be referenced in DO CONCURRENT
      a(i) = bar(n,m)
      ^^^^^^^^^^^^^^^

So, side-effects are checked only for the body of the loop and not for the bounds calcualations (which are still emitted inside the loop body).

@ergawy
Copy link
Author

ergawy commented Aug 2, 2024

From the spec:

C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.

which is expected, but what I cannot put my hands on is what constitues "within a DO CONCURRENT"? Does it include the concurrent-header?

Copy link

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?

The problem here are the hlfir.end_associate ops. Even though the "effectively 2" loops are perfectly nested, we have these hlfir.end_associate ops that are not part of the slice responsible for computing the upper bound (in this case) of the inner loop even though they are in-practice exist only for the purpose of that computation.

Thanks for the explanation. I wouldn't consider the loops perfectly nested though. Calling bar can have arbitrary side-effects, like accessing and incrementing global variables.

For such cases OpenMP specifies that side-effects of upper/lower bound expressions are undefined when/how often they are evaluated, but this does not apply here, so we cannot optimize based on that. Even if, the produced HLFIR seems indistinguishable from

  do concurrent(i=1:n)
    ub = bar(n*m, n/m)
    do concurrent(j=1:ub)
      a(i) = n
    end do
  end do

which (obviously?) is not perfectly nested. I think the frontend should be changed here. the call to bar should be emitted before the outer loop, it cannot depend on i and needs to be evaluated only once. Potentially adding a special case for when the n is zero to not call bar at all, if that is an issue.

In the meantime I this it is OK to just not support function calls as lb/ub expressions.

If you take again the same sample mentioned the previous comment, shouldn't flang have emitted:

      %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32

before the outermost loop (the i loop)?

Wrote all of the above before I continued reading the following discussion... 😒

@@ -36,7 +36,8 @@ namespace fir {
#include "flang/Optimizer/Transforms/Passes.h.inc"
} // namespace fir

#define DEBUG_TYPE "fopenmp-do-concurrent-conversion"
#define DEBUG_TYPE "do-concurrent-conversion"
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE << "]: ")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmh, OK. I don't think it is a good idea but apparently some MLIR folks do.

flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp Outdated Show resolved Hide resolved
@Meinersbur
Copy link

Meinersbur commented Aug 6, 2024

From the spec:

C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.

which is expected, but what I cannot put my hands on is what constitues "within a DO CONCURRENT"? Does it include the concurrent-header?

I think they mean only the body. The header expressions can be evaluated once before entering any concurrent execution, so should be fine. Unfortunately flang currently doesn't seem to do so atm. Note that is make the way flang emits the loop violate this:

  do concurrent(i=1:n)
    ub = bar(n*m, n/m)
    do concurrent(j=1:ub)
      a(i) = n
    end do
  end do

since the non-pure bar is in the outer loop's body.

@ergawy
Copy link
Author

ergawy commented Aug 16, 2024

Update: I had a call with Kiran and Harish, and they will be working on making sure we emit code that is more consistent with the spec.

@ergawy ergawy force-pushed the more_robust_loop_nest_detection branch 2 times, most recently from fbbf705 to 8df3bac Compare August 22, 2024 05:30
@ergawy
Copy link
Author

ergawy commented Aug 22, 2024

@Meinersbur @mjklemm Even though we still need to resolve the code-gen issue of do concurrent (as I mentioned above Kiran and Harish are looking into it), I want to move this forward for the non-controversial parts. Therefore, I removed the parts related to collecting mem free ops. This results in being a bit more conservative in detecting loop-nests (see the test file loop_nest_test.f90 for cases where we currently fail to detect perfect nests).

But at the same time, we now have better detection logic than before. And the algorithm to do so is cleaner and easier to understand.

@ergawy ergawy force-pushed the more_robust_loop_nest_detection branch 2 times, most recently from 71ece67 to c41c26a Compare August 22, 2024 05:43
@ergawy ergawy force-pushed the more_robust_loop_nest_detection branch 2 times, most recently from a4e6df3 to ef20b85 Compare August 27, 2024 15:53
@ergawy ergawy force-pushed the more_robust_loop_nest_detection branch from ef20b85 to 2447f0a Compare August 31, 2024 13:32
@ergawy
Copy link
Author

ergawy commented Sep 5, 2024

@Meinersbur @mjklemm Even though we still need to resolve the code-gen issue of do concurrent (as I mentioned above Kiran and Harish are looking into it), I want to move this forward for the non-controversial parts. Therefore, I removed the parts related to collecting mem free ops. This results in being a bit more conservative in detecting loop-nests (see the test file loop_nest_test.f90 for cases where we currently fail to detect perfect nests).

But at the same time, we now have better detection logic than before. And the algorithm to do so is cleaner and easier to understand.

Ping Ping! 🔔 Please take a look when you have time 🙏

The previous loop-nest detection algorithm fell short, in some cases, to
detect whether a pair of `do concurrent` loops are perfectly nested or
not. This is a re-implementation using forward and backward slice
extraction algorithms to compare the set of ops required to setup the
inner loop bounds vs. the set of ops nested in the outer loop other
thatn the nested loop itself.
@ergawy ergawy force-pushed the more_robust_loop_nest_detection branch from 2447f0a to cf6eb8f Compare October 16, 2024 10:09
Copy link

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and sorry for the delay

@ergawy
Copy link
Author

ergawy commented Oct 31, 2024

Thanks all for the review and discussions on this PR. I just merged llvm#114020 which means that do concurrent nests are now more conformant to the spec. This is changes loop nest detection logic in the do concurrent mapping pass; actually it should make it much simpler. However, this means I probably have to abandon this PR. Time and effort was not wasted though, llvm#114020 wouldn't have been introduced without this discussion.

ergawy added a commit to ergawy/llvm-project that referenced this pull request Nov 1, 2024
With llvm#114020, do-concurrent loop-nests are more conforment to the spec and easier to detect. All we need to do is to check that the only operations inside `loop A` which perfectly wraps `loop B` are:

  * the operations needed to update `loop A`'s iteration variable and
  * `loop B` itself.

This PR simlifies the pass a bit using the above logic and replaces ROCm#127.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Nov 1, 2024
With llvm#114020, do-concurrent loop-nests are more conforment to the spec and easier to detect. All we need to do is to check that the only operations inside `loop A` which perfectly wraps `loop B` are:

  * the operations needed to update `loop A`'s iteration variable and
  * `loop B` itself.

This PR simlifies the pass a bit using the above logic and replaces ROCm#127.
@ergawy
Copy link
Author

ergawy commented Nov 1, 2024

Will close this in favor of #199. @Meinersbur please take a look, I still cannot add you as a reviewer for some reason.

@ergawy ergawy closed this Nov 1, 2024
ergawy added a commit to ergawy/llvm-project that referenced this pull request Nov 1, 2024
With llvm#114020, do-concurrent loop-nests are more conforment to the spec and easier to detect. All we need to do is to check that the only operations inside `loop A` which perfectly wraps `loop B` are:

  * the operations needed to update `loop A`'s iteration variable and
  * `loop B` itself.

This PR simlifies the pass a bit using the above logic and replaces ROCm#127.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Nov 6, 2024
With llvm#114020, do-concurrent loop-nests are more conforment to the spec and easier to detect. All we need to do is to check that the only operations inside `loop A` which perfectly wraps `loop B` are:

  * the operations needed to update `loop A`'s iteration variable and
  * `loop B` itself.

This PR simlifies the pass a bit using the above logic and replaces ROCm#127.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Nov 8, 2024
With llvm#114020, do-concurrent loop-nests are more conforment to the spec and easier to detect. All we need to do is to check that the only operations inside `loop A` which perfectly wraps `loop B` are:

  * the operations needed to update `loop A`'s iteration variable and
  * `loop B` itself.

This PR simlifies the pass a bit using the above logic and replaces ROCm#127.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Nov 21, 2024
With llvm#114020, do-concurrent loop-nests are more conforment to the spec and easier to detect. All we need to do is to check that the only operations inside `loop A` which perfectly wraps `loop B` are:

  * the operations needed to update `loop A`'s iteration variable and
  * `loop B` itself.

This PR simlifies the pass a bit using the above logic and replaces ROCm#127.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants