Skip to content

Conversation

@skatrak
Copy link
Member

@skatrak skatrak commented Oct 28, 2025

This set of patches removes the early tagging of Generic-SPMD target regions from MLIR to instead only tell apart Generic from SPMD. This matches the behavior of Clang, which then relies on the OpenMPOpt pass to detect situations where Generic kernels can be executed in SPMD mode, potentially after certain transformations.

Merging this PR results in split distribute + parallel do kernels running in Generic mode, which might cause performance regressions in these cases. This is because the OpenMPOpt pass is currently not prepared to properly SPMDize Generic kernels containing new DeviceRTL loop functions that only Flang currently generates.

Generic mode before these changes is broken when parallel regions are reached. With this, it should be possible to properly execute them.

This patch removes logic from MLIR to attempt identifying Generic kernels that
could be executed in SPMD mode.

This optimization is done by the OpenMPOpt pass for Clang and is only required
here to circumvent missing support for the new DeviceRTL APIs used in MLIR to
LLVM IR translation that Clang doesn't currently use (e.g.
`kmpc_distribute_static_loop`). Removing checks in MLIR avoids duplicating the
logic that should be centralized in the OpenMPOpt pass.

Additionally, offloading kernels currently compiled through the OpenMP dialect
fail to run parallel regions properly when in Generic mode. By disabling early
detection, this issue becomes apparent for a range of kernels where this was
masked by having them run in SPMD mode.

Update TargetRegionFlags to mirror OMPTgtExecModeFlags
This patch adds the `__kmpc_alloc_shared` and `__kmpc_free_shared` DeviceRTL
functions to the list of those the OMPIRBuilder is able to create.
This patch updates the allocation of some reduction and private variables
within target regions to use device shared memory rather than private memory.
This is a prerequisite to produce working Generic kernels containing parallel
regions.

In particular, the following situations result in the usage of device shared
memory (only when compiling for the target device if they are placed inside of
a target region representing a Generic kernel):
- Reduction variables on `teams` constructs.
- Private variables on `teams` and `distribute` constructs that are reduced or
used inside of a `parallel` region.

Currently, there is no support for delayed privatization on `teams` constructs,
so private variables on these constructs won't currently be affected. When
support is added, if it uses the existing `allocatePrivateVars` and
`cleanupPrivateVars` functions, usage of device shared memory will be
introduced automatically.
Argument structures are created when sections of the LLVM IR corresponding to
an OpenMP construct are outlined into their own function. For this, stack
allocations are used.

This patch modifies this behavior when compiling for a target device and
outlining `parallel`-related IR, so that it uses device shared memory instead
of private stack space. This is needed in order for threads to have access to
these arguments.

Address intermittent ICE triggered from the `OpenMPIRBuilder::finalize` method due to an invalid builder insertion point

Replace CodeExtractor callbacks with subclasses and simplify their creation based on OutlineInfo structures
This patch introduces codegen logic to produce a wrapper function argument for
the `__kmpc_parallel_51` DeviceRTL function needed to handle arguments passed
using device shared memory in Generic mode.
…unctions

This patch updates the OpenMP optimization pass to know about the new DeviceRTL
functions for loop constructs.

This change marks these functions as potentially containing parallel regions,
which fixes a current bug with the state machine rewrite optimization. It
previously failed to identify parallel regions located inside of the callbacks
passed to these new DeviceRTL functions, causing the resulting code to skip
executing these parallel regions.

As a result, Generic kernels produced by Flang that contain parallel regions
now work properly.

One known related issue not fixed by this patch is that the presence of calls
to these functions will prevent the SPMD-ization of Generic kernels by
OpenMPOpt. Previously, this was due to assuming there was no parallel region.
This is changed by this patch, but instead we now mark it temporarily as
unsupported in an SPMD context. The reason is that, without additional changes,
code intended for the main thread of the team located outside of the parallel
region would not be guarded properly, resulting in race conditions and
generally invalid behavior.
In this patch, some OMPIRBuilder codegen functions and callbacks are updated to
work with arrays of deallocation insertion points. The purpose of this is to
enable the replacement of `alloca`s with other types of allocations that
require explicit deallocations in a way that makes it possible for
`CodeExtractor` instances created during OMPIRBuilder finalization to also use
them.

The OpenMP to LLVM IR MLIR translation pass is updated to properly store and
forward deallocation points together with their matching allocation point to
the OMPIRBuilder.

Currently, only the `DeviceSharedMemCodeExtractor` uses this feature to get the
`CodeExtractor` to use device shared memory for intermediate allocations when
outlining a parallel region inside of a Generic kernel (code path that is only
used by Flang via MLIR, currently). However, long term this might also be
useful to refactor finalization of variables with destructors, potentially
reducing the use of callbacks and simplifying privatization and reductions.

Instead of a single deallocation point, lists of those are used. This is to
cover cases where there are multiple exit blocks originating from a single
entry. If an allocation needing explicit deallocation is placed in the entry
block of such cases, it would need to be deallocated before each of the exits.
This patch moves tablegen definitions that could be used for all kinds of heap
allocations out of `omp.target_allocmem` and into a new
`OpenMP_HeapAllocClause` that can be reused.

Descriptions are updated to follow the format of most other operations and the
custom verifier for `omp.target_allocmem` is removed as it only made a
redundant check on its result type.
This patch introduces the `omp.alloc_shared_mem` and `omp.free_shared_mem`
operations to represent explicit allocations and deallocations of shared memory
across threads in a team, mirroring the existing `omp.target_allocmem` and
`omp.target_freemem`.

The `omp.alloc_shared_mem` op goes through the same Flang-specific
transformations as `omp.target_allocmem`, so that the size of the buffer can be
properly calculated when translating to LLVM IR.

The corresponding runtime functions produced for these new operations are
`__kmpc_alloc_shared` and `__kmpc_free_shared`, which previously could only be
created for implicit allocations (e.g. privatized and reduction variables).
This patch introduces a new Flang OpenMP MLIR pass, only ran for target device
modules, that identifies `fir.alloca` operations that should use device shared
memory and replaces them with pairs of `omp.alloc_shared_mem` and
`omp.free_shared_mem` operations.

This works in conjunction to the MLIR to LLVM IR translation pass' handling of
privatization, mapping and reductions in the OpenMP dialect to properly select
the right memory space for allocations based on where they are made and where
they are used.

This pass, in particular, handles explicit stack allocations in MLIR, whereas
the aforementioned translation pass takes care of implicit ones represented by
entry block arguments.
This patch refines checks to decide whether to use device shared memory or
regular stack allocations. In particular, it adds support for parallel regions
residing on standalone target device functions.

The changes are:
- Shared memory is introduced for `omp.target` implicit allocations, such as
those related to privatization and mapping, as long as they are shared across
threads in a nested parallel region.
- Standalone target device functions are interpreted as being part of a Generic
kernel, since the fact that they are present in the module after filtering
means they must be reachable from a target region.
- Prevent allocations whose only shared uses inside of an `omp.parallel` region
are as part of a `private` clause from being moved to device shared memory.
This patch updates MLIR lowering of `fir.embox` and `fircg.ext_embox`
operations to potentially use OpenMP device shared memory for the created
descriptor when compiling for a target device. Any operations introducing
stack allocations inside of a target or teams constructs but outside of a
parallel region, and passing that value into a parallel region or to another
function that might contain one, need to instead use device shared memory for
correctness when running on a GPU.

Also, the logic deciding whether to use device shared memory in place of stack
allocations is updated to also use the former when that memory is passed as an
argument to a function.
@skatrak skatrak requested review from dpalermo and mjklemm October 28, 2025 14:08
@z1-cciauto
Copy link
Collaborator

@z1-cciauto
Copy link
Collaborator

@skatrak skatrak force-pushed the amd/dev/safonsof/flang-generic branch from 0e329dd to 609eda1 Compare October 31, 2025 15:23
@z1-cciauto
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants