[Flang] Support generic execution of parallel regions #414

skatrak · 2025-10-28T14:08:00Z

This set of patches removes the early tagging of Generic-SPMD target regions from MLIR to instead only tell apart Generic from SPMD. This matches the behavior of Clang, which then relies on the OpenMPOpt pass to detect situations where Generic kernels can be executed in SPMD mode, potentially after certain transformations.

Merging this PR results in split distribute + parallel do kernels running in Generic mode, which might cause performance regressions in these cases. This is because the OpenMPOpt pass is currently not prepared to properly SPMDize Generic kernels containing new DeviceRTL loop functions that only Flang currently generates.

Generic mode before these changes is broken when parallel regions are reached. With this, it should be possible to properly execute them.

This patch removes logic from MLIR to attempt identifying Generic kernels that could be executed in SPMD mode. This optimization is done by the OpenMPOpt pass for Clang and is only required here to circumvent missing support for the new DeviceRTL APIs used in MLIR to LLVM IR translation that Clang doesn't currently use (e.g. `kmpc_distribute_static_loop`). Removing checks in MLIR avoids duplicating the logic that should be centralized in the OpenMPOpt pass. Additionally, offloading kernels currently compiled through the OpenMP dialect fail to run parallel regions properly when in Generic mode. By disabling early detection, this issue becomes apparent for a range of kernels where this was masked by having them run in SPMD mode. Update TargetRegionFlags to mirror OMPTgtExecModeFlags

This patch adds the `__kmpc_alloc_shared` and `__kmpc_free_shared` DeviceRTL functions to the list of those the OMPIRBuilder is able to create.

This patch updates the allocation of some reduction and private variables within target regions to use device shared memory rather than private memory. This is a prerequisite to produce working Generic kernels containing parallel regions. In particular, the following situations result in the usage of device shared memory (only when compiling for the target device if they are placed inside of a target region representing a Generic kernel): - Reduction variables on `teams` constructs. - Private variables on `teams` and `distribute` constructs that are reduced or used inside of a `parallel` region. Currently, there is no support for delayed privatization on `teams` constructs, so private variables on these constructs won't currently be affected. When support is added, if it uses the existing `allocatePrivateVars` and `cleanupPrivateVars` functions, usage of device shared memory will be introduced automatically.

Argument structures are created when sections of the LLVM IR corresponding to an OpenMP construct are outlined into their own function. For this, stack allocations are used. This patch modifies this behavior when compiling for a target device and outlining `parallel`-related IR, so that it uses device shared memory instead of private stack space. This is needed in order for threads to have access to these arguments. Address intermittent ICE triggered from the `OpenMPIRBuilder::finalize` method due to an invalid builder insertion point Replace CodeExtractor callbacks with subclasses and simplify their creation based on OutlineInfo structures

This patch introduces codegen logic to produce a wrapper function argument for the `__kmpc_parallel_51` DeviceRTL function needed to handle arguments passed using device shared memory in Generic mode.

…unctions This patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs. This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions. As a result, Generic kernels produced by Flang that contain parallel regions now work properly. One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior.

In this patch, some OMPIRBuilder codegen functions and callbacks are updated to work with arrays of deallocation insertion points. The purpose of this is to enable the replacement of `alloca`s with other types of allocations that require explicit deallocations in a way that makes it possible for `CodeExtractor` instances created during OMPIRBuilder finalization to also use them. The OpenMP to LLVM IR MLIR translation pass is updated to properly store and forward deallocation points together with their matching allocation point to the OMPIRBuilder. Currently, only the `DeviceSharedMemCodeExtractor` uses this feature to get the `CodeExtractor` to use device shared memory for intermediate allocations when outlining a parallel region inside of a Generic kernel (code path that is only used by Flang via MLIR, currently). However, long term this might also be useful to refactor finalization of variables with destructors, potentially reducing the use of callbacks and simplifying privatization and reductions. Instead of a single deallocation point, lists of those are used. This is to cover cases where there are multiple exit blocks originating from a single entry. If an allocation needing explicit deallocation is placed in the entry block of such cases, it would need to be deallocated before each of the exits.

This patch moves tablegen definitions that could be used for all kinds of heap allocations out of `omp.target_allocmem` and into a new `OpenMP_HeapAllocClause` that can be reused. Descriptions are updated to follow the format of most other operations and the custom verifier for `omp.target_allocmem` is removed as it only made a redundant check on its result type.

This patch introduces the `omp.alloc_shared_mem` and `omp.free_shared_mem` operations to represent explicit allocations and deallocations of shared memory across threads in a team, mirroring the existing `omp.target_allocmem` and `omp.target_freemem`. The `omp.alloc_shared_mem` op goes through the same Flang-specific transformations as `omp.target_allocmem`, so that the size of the buffer can be properly calculated when translating to LLVM IR. The corresponding runtime functions produced for these new operations are `__kmpc_alloc_shared` and `__kmpc_free_shared`, which previously could only be created for implicit allocations (e.g. privatized and reduction variables).

This patch introduces a new Flang OpenMP MLIR pass, only ran for target device modules, that identifies `fir.alloca` operations that should use device shared memory and replaces them with pairs of `omp.alloc_shared_mem` and `omp.free_shared_mem` operations. This works in conjunction to the MLIR to LLVM IR translation pass' handling of privatization, mapping and reductions in the OpenMP dialect to properly select the right memory space for allocations based on where they are made and where they are used. This pass, in particular, handles explicit stack allocations in MLIR, whereas the aforementioned translation pass takes care of implicit ones represented by entry block arguments.

This patch refines checks to decide whether to use device shared memory or regular stack allocations. In particular, it adds support for parallel regions residing on standalone target device functions. The changes are: - Shared memory is introduced for `omp.target` implicit allocations, such as those related to privatization and mapping, as long as they are shared across threads in a nested parallel region. - Standalone target device functions are interpreted as being part of a Generic kernel, since the fact that they are present in the module after filtering means they must be reachable from a target region. - Prevent allocations whose only shared uses inside of an `omp.parallel` region are as part of a `private` clause from being moved to device shared memory.

… variables deallocation

This patch updates MLIR lowering of `fir.embox` and `fircg.ext_embox` operations to potentially use OpenMP device shared memory for the created descriptor when compiling for a target device. Any operations introducing stack allocations inside of a target or teams constructs but outside of a parallel region, and passing that value into a parallel region or to another function that might contain one, need to instead use device shared memory for correctness when running on a GPU. Also, the logic deciding whether to use device shared memory in place of stack allocations is updated to also use the former when that memory is passed as an argument to a function.

…ranslation

…s from Flang to MLIR

z1-cciauto · 2025-10-28T14:08:23Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2483

z1-cciauto · 2025-10-31T15:08:39Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2578

z1-cciauto · 2025-10-31T15:25:04Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2579

skatrak added 20 commits October 28, 2025 05:22

[OpenMP][OMPIRBuilder] Add device shared memory allocation support

d3c02e9

This patch adds the `__kmpc_alloc_shared` and `__kmpc_free_shared` DeviceRTL functions to the list of those the OMPIRBuilder is able to create.

[OpenMP][OMPIRBuilder] Support parallel in Generic kernels

8c1f771

This patch introduces codegen logic to produce a wrapper function argument for the `__kmpc_parallel_51` DeviceRTL function needed to handle arguments passed using device shared memory in Generic mode.

Address test failures: enable passing test and fix omp.target private…

4af004b

… variables deallocation

delay stack to shared pass to process all llvm.mlir.allocas

64b584b

move stack-to-shared pass from Flang to the OpenMP dialect

09520da

Unify device shared memory logic for fix-up pass and MLIR to LLVMRI t…

9bbee8f

…ranslation

Simplify omp.alloc_shared_mem

4c1c46a

Make omp-stack-to-shared pass available to the compiler and move test…

afd271b

…s from Flang to MLIR

remove spurious diffs

9f113a4

Remove overly restrictive verifier check for omp.target_freemem

270a889

skatrak requested review from dpalermo and mjklemm October 28, 2025 14:08

Fix issue with non-pointer parallel region inputs

609eda1

skatrak force-pushed the amd/dev/safonsof/flang-generic branch from 0e329dd to 609eda1 Compare October 31, 2025 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flang] Support generic execution of parallel regions #414

[Flang] Support generic execution of parallel regions #414

skatrak commented Oct 28, 2025

Uh oh!

z1-cciauto commented Oct 28, 2025

Uh oh!

z1-cciauto commented Oct 31, 2025

Uh oh!

z1-cciauto commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Flang] Support generic execution of parallel regions #414

Are you sure you want to change the base?

[Flang] Support generic execution of parallel regions #414

Conversation

skatrak commented Oct 28, 2025

Uh oh!

z1-cciauto commented Oct 28, 2025

Uh oh!

z1-cciauto commented Oct 31, 2025

Uh oh!

z1-cciauto commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants