[flang][MLIR] Hoist `do concurrent` nest bounds/steps outside the nest #114020

ergawy · 2024-10-29T09:47:23Z

If you have the following multi-range do concurrent loop:

  do concurrent(i=1:n, j=1:bar(n*m, n/m))
    a(i) = n
  end do

Currently, flang generates the following IR:

    fir.do_loop %arg1 = %42 to %44 step %c1 unordered {
      ...
      %53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
      %54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
      %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
      hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1
      hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1
      %56 = fir.convert %55 : (i32) -> index
      ...
      fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered {
        ...
      }
    }

However, if bar is impure, then we have a direct violation of the standard:

C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.

Moreover, the standard describes the execution of do concurrent construct in multiple stages:

11.1.7.4 Execution of a DO construct
...
11.1.7.4.2 DO CONCURRENT loop control
The concurrent-limit and concurrent-step expressions in the concurrent-control-list are evaluated. ...

11.1.7.4.3 The execution cycle
...
The block of a DO CONCURRENT construct is executed for every active combination of the index-name values.
Each execution of the block is an iteration. The executions may occur in any order.

From the above 2 points, it seems to me that execution is divided in multiple consecutive stages: 11.1.7.4.2 is the stage where we evaluate all control expressions including the step and then 11.1.7.4.3 is the stage to execute the block of the concurrent loop itself using the combination of possible iteration values.

llvmbot · 2024-10-29T09:48:36Z

@llvm/pr-subscribers-flang-fir-hlfir

Author: Kareem Ergawy (ergawy)

Changes

If you have the following multi-range do concurrent loop:

  do concurrent(i=1:n, j=1:bar(n*m, n/m))
    a(i) = n
  end do

Currently, flang generates the following IR:

    fir.do_loop %arg1 = %42 to %44 step %c1 unordered {
      ...
      %53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -&gt; (!fir.ref&lt;i32&gt;, !fir.ref&lt;i32&gt;, i1)
      %54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -&gt; (!fir.ref&lt;i32&gt;, !fir.ref&lt;i32&gt;, i1)
      %55 = fir.call @<!-- -->_QFPbar(%53#<!-- -->1, %54#<!-- -->1) fastmath&lt;contract&gt; : (!fir.ref&lt;i32&gt;, !fir.ref&lt;i32&gt;) -&gt; i32
      hlfir.end_associate %53#<!-- -->1, %53#<!-- -->2 : !fir.ref&lt;i32&gt;, i1
      hlfir.end_associate %54#<!-- -->1, %54#<!-- -->2 : !fir.ref&lt;i32&gt;, i1
      %56 = fir.convert %55 : (i32) -&gt; index
      ...
      fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered {
        ...
      }
    }

However, if bar is impure, then we have a direct violation of the standard:

C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.

Moreover, the standard describes the execution of do concurrent construct in multiple stages:

11.1.7.4 Execution of a DO construct
...
11.1.7.4.2 DO CONCURRENT loop control
The concurrent-limit and concurrent-step expressions in the concurrent-control-list are evaluated. ...

11.1.7.4.3 The execution cycle
...
The block of a DO CONCURRENT construct is executed for every active combination of the index-name values.
Each execution of the block is an iteration. The executions may occur in any order.

From the above 2 points, it seems to me that execution is divided in multiple consecutive stages: 11.1.7.4.2 is the stage where we evaluate all control expressions including the step and then 11.1.7.4.3 is the stage to execute the block of the concurrent loop itself using the combination of possible iteration values.

Full diff: https://github.com/llvm/llvm-project/pull/114020.diff

2 Files Affected:

(modified) flang/lib/Lower/Bridge.cpp (+26-11)
(added) flang/test/Lower/do_concurrent.f90 (+59)

diff --git a/flang/lib/Lower/Bridge.cpp b/flang/lib/Lower/Bridge.cpp
index 877fe122265dd0..904dd82ed8149f 100644
--- a/flang/lib/Lower/Bridge.cpp
+++ b/flang/lib/Lower/Bridge.cpp
@@ -2131,18 +2131,33 @@ class FirConverter : public Fortran::lower::AbstractConverter {
       llvm::SmallVectorImpl<const Fortran::parser::CompilerDirective *> &dirs) {
     assert(!incrementLoopNestInfo.empty() && "empty loop nest");
     mlir::Location loc = toLocation();
+
     for (IncrementLoopInfo &info : incrementLoopNestInfo) {
-      info.loopVariable =
-          genLoopVariableAddress(loc, *info.loopVariableSym, info.isUnordered);
-      mlir::Value lowerValue = genControlValue(info.lowerExpr, info);
-      mlir::Value upperValue = genControlValue(info.upperExpr, info);
-      bool isConst = true;
-      mlir::Value stepValue = genControlValue(
-          info.stepExpr, info, info.isStructured() ? nullptr : &isConst);
-      // Use a temp variable for unstructured loops with non-const step.
-      if (!isConst) {
-        info.stepVariable = builder->createTemporary(loc, stepValue.getType());
-        builder->create<fir::StoreOp>(loc, stepValue, info.stepVariable);
+      mlir::Value lowerValue;
+      mlir::Value upperValue;
+      mlir::Value stepValue;
+
+      {
+        mlir::OpBuilder::InsertionGuard guard(*builder);
+
+        // Set the IP before the first loop in the nest so that all nest bounds
+        // and step values are created outside the nest.
+        if (incrementLoopNestInfo[0].doLoop)
+          builder->setInsertionPoint(incrementLoopNestInfo[0].doLoop);
+
+        info.loopVariable = genLoopVariableAddress(loc, *info.loopVariableSym,
+                                                   info.isUnordered);
+        lowerValue = genControlValue(info.lowerExpr, info);
+        upperValue = genControlValue(info.upperExpr, info);
+        bool isConst = true;
+        stepValue = genControlValue(info.stepExpr, info,
+                                    info.isStructured() ? nullptr : &isConst);
+        // Use a temp variable for unstructured loops with non-const step.
+        if (!isConst) {
+          info.stepVariable =
+              builder->createTemporary(loc, stepValue.getType());
+          builder->create<fir::StoreOp>(loc, stepValue, info.stepVariable);
+        }
       }
 
       // Structured loop - generate fir.do_loop.
diff --git a/flang/test/Lower/do_concurrent.f90 b/flang/test/Lower/do_concurrent.f90
new file mode 100644
index 00000000000000..cc6bbf69d21b71
--- /dev/null
+++ b/flang/test/Lower/do_concurrent.f90
@@ -0,0 +1,59 @@
+! RUN: %flang_fc1 -emit-hlfir -o - %s | FileCheck %s
+
+! Simple tests for structured concurrent loops with loop-control.
+
+pure function bar(n, m)
+   implicit none
+   integer, intent(in) :: n, m
+   integer :: bar
+   bar = n + m
+end function
+
+subroutine sub1(n)
+   implicit none
+   integer :: n, m, i, j, k
+   integer, dimension(n) :: a
+!CHECK: %[[LB1:.*]] = arith.constant 1 : i32
+!CHECK: %[[LB1_CVT:.*]] = fir.convert %[[LB1]] : (i32) -> index
+!CHECK: %[[UB1:.*]] = fir.load %{{.*}}#0 : !fir.ref<i32>
+!CHECK: %[[UB1_CVT:.*]] = fir.convert %[[UB1]] : (i32) -> index
+
+!CHECK: %[[LB2:.*]] = arith.constant 1 : i32
+!CHECK: %[[LB2_CVT:.*]] = fir.convert %[[LB2]] : (i32) -> index
+!CHECK: %[[UB2:.*]] = fir.call @_QPbar(%{{.*}}, %{{.*}}) proc_attrs<pure> fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
+!CHECK: %[[UB2_CVT:.*]] = fir.convert %[[UB2]] : (i32) -> index
+
+!CHECK: %[[LB3:.*]] = arith.constant 5 : i32
+!CHECK: %[[LB3_CVT:.*]] = fir.convert %[[LB3]] : (i32) -> index
+!CHECK: %[[UB3:.*]] = arith.constant 10 : i32
+!CHECK: %[[UB3_CVT:.*]] = fir.convert %[[UB3]] : (i32) -> index
+
+!CHECK: fir.do_loop %{{.*}} = %[[LB1_CVT]] to %[[UB1_CVT]] step %{{.*}} unordered
+!CHECK: fir.do_loop %{{.*}} = %[[LB2_CVT]] to %[[UB2_CVT]] step %{{.*}} unordered
+!CHECK: fir.do_loop %{{.*}} = %[[LB3_CVT]] to %[[UB3_CVT]] step %{{.*}} unordered
+
+   do concurrent(i=1:n, j=1:bar(n*m, n/m), k=5:10)
+      a(i) = n
+   end do
+end subroutine
+
+subroutine sub2(n)
+   implicit none
+   integer :: n, m, i, j
+   integer, dimension(n) :: a
+!CHECK: %[[LB1:.*]] = arith.constant 1 : i32
+!CHECK: %[[LB1_CVT:.*]] = fir.convert %[[LB1]] : (i32) -> index
+!CHECK: %[[UB1:.*]] = fir.load %5#0 : !fir.ref<i32>
+!CHECK: %[[UB1_CVT:.*]] = fir.convert %[[UB1]] : (i32) -> index
+!CHECK: fir.do_loop %{{.*}} = %[[LB1_CVT]] to %[[UB1_CVT]] step %{{.*}} unordered
+!CHECK: %[[LB2:.*]] = arith.constant 1 : i32
+!CHECK: %[[LB2_CVT:.*]] = fir.convert %[[LB2]] : (i32) -> index
+!CHECK: %[[UB2:.*]] = fir.call @_QPbar(%{{.*}}, %{{.*}}) proc_attrs<pure> fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
+!CHECK: %[[UB2_CVT:.*]] = fir.convert %[[UB2]] : (i32) -> index
+!CHECK: fir.do_loop %{{.*}} = %[[LB2_CVT]] to %[[UB2_CVT]] step %{{.*}} unordered
+   do concurrent(i=1:n)
+      do concurrent(j=1:bar(n*m, n/m))
+         a(i) = n
+      end do
+   end do
+end subroutine

ergawy · 2024-10-29T09:49:15Z

This replaces #111665.

jeanPerier

Thank you for this update, looks good to me.

flang/lib/Lower/Bridge.cpp

NimishMishra

Thanks for this. LG

vdonaldson

Thanks for the generalization upate

flang/test/Lower/do_concurrent.f90

If you have the following multi-range `do concurrent` loop: ```fortran do concurrent(i=1:n, j=1:bar(n*m, n/m)) a(i) = n end do ``` Currently, flang generates the following IR: ```mlir fir.do_loop %arg1 = %42 to %44 step %c1 unordered { ... %53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1) %54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1) %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32 hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1 hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1 %56 = fir.convert %55 : (i32) -> index ... fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered { ... } } ``` However, if `bar` is impure, then we have a direct violation of the standard: ``` C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct. ``` Moreover, the standard describes the execution of `do concurrent` construct in multiple stages: ``` 11.1.7.4 Execution of a DO construct ... 11.1.7.4.2 DO CONCURRENT loop control The concurrent-limit and concurrent-step expressions in the concurrent-control-list are evaluated. ... 11.1.7.4.3 The execution cycle ... The block of a DO CONCURRENT construct is executed for every active combination of the index-name values. Each execution of the block is an iteration. The executions may occur in any order. ``` From the above 2 points, it seems to me that execution is divided in multiple consecutive stages: 11.1.7.4.2 is the stage where we evaluate all control expressions including the step and then 11.1.7.4.3 is the stage to execute the block of the concurrent loop itself using the combination of possible iteration values.

#198) If you have the following multi-range `do concurrent` loop: ```fortran do concurrent(i=1:n, j=1:bar(n*m, n/m)) a(i) = n end do ``` Currently, flang generates the following IR: ```mlir fir.do_loop %arg1 = %42 to %44 step %c1 unordered { ... %53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1) %54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1) %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32 hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1 hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1 %56 = fir.convert %55 : (i32) -> index ... fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered { ... } } ``` However, if `bar` is impure, then we have a direct violation of the standard: ``` C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct. ``` Moreover, the standard describes the execution of `do concurrent` construct in multiple stages: ``` 11.1.7.4 Execution of a DO construct ... 11.1.7.4.2 DO CONCURRENT loop control The concurrent-limit and concurrent-step expressions in the concurrent-control-list are evaluated. ... 11.1.7.4.3 The execution cycle ... The block of a DO CONCURRENT construct is executed for every active combination of the index-name values. Each execution of the block is an iteration. The executions may occur in any order. ``` From the above 2 points, it seems to me that execution is divided in multiple consecutive stages: 11.1.7.4.2 is the stage where we evaluate all control expressions including the step and then 11.1.7.4.3 is the stage to execute the block of the concurrent loop itself using the combination of possible iteration values.

With llvm#114020, do-concurrent loop-nests are more conforment to the spec and easier to detect. All we need to do is to check that the only operations inside `loop A` which perfectly wraps `loop B` are: * the operations needed to update `loop A`'s iteration variable and * `loop B` itself. This PR simlifies the pass a bit using the above logic and replaces ROCm#127.

llvm#114020) If you have the following multi-range `do concurrent` loop: ```fortran do concurrent(i=1:n, j=1:bar(n*m, n/m)) a(i) = n end do ``` Currently, flang generates the following IR: ```mlir fir.do_loop %arg1 = %42 to %44 step %c1 unordered { ... %53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1) %54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1) %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32 hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1 hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1 %56 = fir.convert %55 : (i32) -> index ... fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered { ... } } ``` However, if `bar` is impure, then we have a direct violation of the standard: ``` C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct. ``` Moreover, the standard describes the execution of `do concurrent` construct in multiple stages: ``` 11.1.7.4 Execution of a DO construct ... 11.1.7.4.2 DO CONCURRENT loop control The concurrent-limit and concurrent-step expressions in the concurrent-control-list are evaluated. ... 11.1.7.4.3 The execution cycle ... The block of a DO CONCURRENT construct is executed for every active combination of the index-name values. Each execution of the block is an iteration. The executions may occur in any order. ``` From the above 2 points, it seems to me that execution is divided in multiple consecutive stages: 11.1.7.4.2 is the stage where we evaluate all control expressions including the step and then 11.1.7.4.3 is the stage to execute the block of the concurrent loop itself using the combination of possible iteration values.

With llvm#114020, do-concurrent loop-nests are more conforment to the spec and easier to detect. All we need to do is to check that the only operations inside `loop A` which perfectly wraps `loop B` are: * the operations needed to update `loop A`'s iteration variable and * `loop B` itself. This PR simlifies the pass a bit using the above logic and replaces ROCm#127.

llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir labels Oct 29, 2024

ergawy requested review from clementval, mjklemm, jeanPerier, vdonaldson, NimishMishra, kiranktp and Thirumalai-Shaktivel October 29, 2024 09:49

ergawy mentioned this pull request Oct 29, 2024

[Flang] Hoist concurrent-limit and concurrent-step expressions outside the outer most do concurrent loop #111665

Open

ergawy changed the title ~~[flang][MLIR] Hoist do concurrent nest bounds/step outside the nest~~ [flang][MLIR] Hoist do concurrent nest bounds/steps outside the nest Oct 29, 2024

jeanPerier approved these changes Oct 29, 2024

View reviewed changes

vdonaldson reviewed Oct 29, 2024

View reviewed changes

flang/lib/Lower/Bridge.cpp Outdated Show resolved Hide resolved

NimishMishra approved these changes Oct 29, 2024

View reviewed changes

ergawy force-pushed the hoist_do_concurrent_bounds branch 2 times, most recently from fec2260 to 67c28e4 Compare October 30, 2024 07:48

vdonaldson approved these changes Oct 30, 2024

View reviewed changes

flang/test/Lower/do_concurrent.f90 Outdated Show resolved Hide resolved

ergawy force-pushed the hoist_do_concurrent_bounds branch from 67c28e4 to 713e076 Compare October 31, 2024 04:41

ergawy force-pushed the hoist_do_concurrent_bounds branch from 713e076 to 3c1a4dc Compare October 31, 2024 06:33

ergawy merged commit 0698482 into llvm:main Oct 31, 2024
8 checks passed

ergawy mentioned this pull request Oct 31, 2024

Hoist do concurrent nest bounds/steps outside the nest (#114020) ROCm/llvm-project#198

Merged

ergawy mentioned this pull request Oct 31, 2024

[flang][OpenMP] Implement more robust loop-nest detection logic ROCm/llvm-project#127

Closed

ergawy mentioned this pull request Nov 1, 2024

[flang][OpenMP][DoConcurrent] Simplify loop-nest detection logic ROCm/llvm-project#199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flang][MLIR] Hoist `do concurrent` nest bounds/steps outside the nest #114020

[flang][MLIR] Hoist `do concurrent` nest bounds/steps outside the nest #114020

ergawy commented Oct 29, 2024

llvmbot commented Oct 29, 2024

ergawy commented Oct 29, 2024

jeanPerier left a comment

NimishMishra left a comment

vdonaldson left a comment

[flang][MLIR] Hoist do concurrent nest bounds/steps outside the nest #114020

[flang][MLIR] Hoist do concurrent nest bounds/steps outside the nest #114020

Conversation

ergawy commented Oct 29, 2024

llvmbot commented Oct 29, 2024

ergawy commented Oct 29, 2024

jeanPerier left a comment

Choose a reason for hiding this comment

NimishMishra left a comment

Choose a reason for hiding this comment

vdonaldson left a comment

Choose a reason for hiding this comment

[flang][MLIR] Hoist `do concurrent` nest bounds/steps outside the nest #114020

[flang][MLIR] Hoist `do concurrent` nest bounds/steps outside the nest #114020