forked from llvm/llvm-project
-
Notifications
You must be signed in to change notification settings - Fork 75
[Flang] Support generic execution of parallel regions #414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
skatrak
wants to merge
21
commits into
amd-staging
Choose a base branch
from
amd/dev/safonsof/flang-generic
base: amd-staging
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,821
−940
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This patch removes logic from MLIR to attempt identifying Generic kernels that could be executed in SPMD mode. This optimization is done by the OpenMPOpt pass for Clang and is only required here to circumvent missing support for the new DeviceRTL APIs used in MLIR to LLVM IR translation that Clang doesn't currently use (e.g. `kmpc_distribute_static_loop`). Removing checks in MLIR avoids duplicating the logic that should be centralized in the OpenMPOpt pass. Additionally, offloading kernels currently compiled through the OpenMP dialect fail to run parallel regions properly when in Generic mode. By disabling early detection, this issue becomes apparent for a range of kernels where this was masked by having them run in SPMD mode. Update TargetRegionFlags to mirror OMPTgtExecModeFlags
This patch adds the `__kmpc_alloc_shared` and `__kmpc_free_shared` DeviceRTL functions to the list of those the OMPIRBuilder is able to create.
This patch updates the allocation of some reduction and private variables within target regions to use device shared memory rather than private memory. This is a prerequisite to produce working Generic kernels containing parallel regions. In particular, the following situations result in the usage of device shared memory (only when compiling for the target device if they are placed inside of a target region representing a Generic kernel): - Reduction variables on `teams` constructs. - Private variables on `teams` and `distribute` constructs that are reduced or used inside of a `parallel` region. Currently, there is no support for delayed privatization on `teams` constructs, so private variables on these constructs won't currently be affected. When support is added, if it uses the existing `allocatePrivateVars` and `cleanupPrivateVars` functions, usage of device shared memory will be introduced automatically.
Argument structures are created when sections of the LLVM IR corresponding to an OpenMP construct are outlined into their own function. For this, stack allocations are used. This patch modifies this behavior when compiling for a target device and outlining `parallel`-related IR, so that it uses device shared memory instead of private stack space. This is needed in order for threads to have access to these arguments. Address intermittent ICE triggered from the `OpenMPIRBuilder::finalize` method due to an invalid builder insertion point Replace CodeExtractor callbacks with subclasses and simplify their creation based on OutlineInfo structures
This patch introduces codegen logic to produce a wrapper function argument for the `__kmpc_parallel_51` DeviceRTL function needed to handle arguments passed using device shared memory in Generic mode.
…unctions This patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs. This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions. As a result, Generic kernels produced by Flang that contain parallel regions now work properly. One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior.
In this patch, some OMPIRBuilder codegen functions and callbacks are updated to work with arrays of deallocation insertion points. The purpose of this is to enable the replacement of `alloca`s with other types of allocations that require explicit deallocations in a way that makes it possible for `CodeExtractor` instances created during OMPIRBuilder finalization to also use them. The OpenMP to LLVM IR MLIR translation pass is updated to properly store and forward deallocation points together with their matching allocation point to the OMPIRBuilder. Currently, only the `DeviceSharedMemCodeExtractor` uses this feature to get the `CodeExtractor` to use device shared memory for intermediate allocations when outlining a parallel region inside of a Generic kernel (code path that is only used by Flang via MLIR, currently). However, long term this might also be useful to refactor finalization of variables with destructors, potentially reducing the use of callbacks and simplifying privatization and reductions. Instead of a single deallocation point, lists of those are used. This is to cover cases where there are multiple exit blocks originating from a single entry. If an allocation needing explicit deallocation is placed in the entry block of such cases, it would need to be deallocated before each of the exits.
This patch moves tablegen definitions that could be used for all kinds of heap allocations out of `omp.target_allocmem` and into a new `OpenMP_HeapAllocClause` that can be reused. Descriptions are updated to follow the format of most other operations and the custom verifier for `omp.target_allocmem` is removed as it only made a redundant check on its result type.
This patch introduces the `omp.alloc_shared_mem` and `omp.free_shared_mem` operations to represent explicit allocations and deallocations of shared memory across threads in a team, mirroring the existing `omp.target_allocmem` and `omp.target_freemem`. The `omp.alloc_shared_mem` op goes through the same Flang-specific transformations as `omp.target_allocmem`, so that the size of the buffer can be properly calculated when translating to LLVM IR. The corresponding runtime functions produced for these new operations are `__kmpc_alloc_shared` and `__kmpc_free_shared`, which previously could only be created for implicit allocations (e.g. privatized and reduction variables).
This patch introduces a new Flang OpenMP MLIR pass, only ran for target device modules, that identifies `fir.alloca` operations that should use device shared memory and replaces them with pairs of `omp.alloc_shared_mem` and `omp.free_shared_mem` operations. This works in conjunction to the MLIR to LLVM IR translation pass' handling of privatization, mapping and reductions in the OpenMP dialect to properly select the right memory space for allocations based on where they are made and where they are used. This pass, in particular, handles explicit stack allocations in MLIR, whereas the aforementioned translation pass takes care of implicit ones represented by entry block arguments.
This patch refines checks to decide whether to use device shared memory or regular stack allocations. In particular, it adds support for parallel regions residing on standalone target device functions. The changes are: - Shared memory is introduced for `omp.target` implicit allocations, such as those related to privatization and mapping, as long as they are shared across threads in a nested parallel region. - Standalone target device functions are interpreted as being part of a Generic kernel, since the fact that they are present in the module after filtering means they must be reachable from a target region. - Prevent allocations whose only shared uses inside of an `omp.parallel` region are as part of a `private` clause from being moved to device shared memory.
… variables deallocation
This patch updates MLIR lowering of `fir.embox` and `fircg.ext_embox` operations to potentially use OpenMP device shared memory for the created descriptor when compiling for a target device. Any operations introducing stack allocations inside of a target or teams constructs but outside of a parallel region, and passing that value into a parallel region or to another function that might contain one, need to instead use device shared memory for correctness when running on a GPU. Also, the logic deciding whether to use device shared memory in place of stack allocations is updated to also use the former when that memory is passed as an argument to a function.
…s from Flang to MLIR
0e329dd to
609eda1
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This set of patches removes the early tagging of Generic-SPMD target regions from MLIR to instead only tell apart Generic from SPMD. This matches the behavior of Clang, which then relies on the OpenMPOpt pass to detect situations where Generic kernels can be executed in SPMD mode, potentially after certain transformations.
Merging this PR results in split distribute + parallel do kernels running in Generic mode, which might cause performance regressions in these cases. This is because the OpenMPOpt pass is currently not prepared to properly SPMDize Generic kernels containing new DeviceRTL loop functions that only Flang currently generates.
Generic mode before these changes is broken when parallel regions are reached. With this, it should be possible to properly execute them.