Feature/merge upstream 20231019 #287

kaz7 · 2024-01-24T02:08:35Z

Merge upstream's main up to 2023/10/19.

This passed the internal regression tests already.

…oadcastable shuffle mask

This patch adds the CodeGen changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. This change relaxes restrictions on what gets emitted on the device path, when compiling in `hipstdpar` mode: 1. Unless a function is explicitly marked `__host__`, it will get emitted, whereas before only `__device__` and `__global__` functions would be emitted; 2. Unsupported builtins are ignored as opposed to being marked as an error, as the decision on their validity is deferred to the `hipstdpar` specific code selection pass; 3. We add a `hipstdpar` specific pass to the opt pipeline, independent of optimisation level: - When compiling for the host, iff the user requested it via the `--hipstdpar-interpose-alloc` flag, we add a pass which replaces canonical allocation / deallocation functions with accelerator aware equivalents. A test to validate that unannotated functions get correctly emitted is added as well. Reviewed by: yaxunl, efriedma Differential Revision: https://reviews.llvm.org/D155850

@p

This PR enhances `BasicPtxBuilder` to support predicates in PTX code generation. The `BasicPtxBuilder` interface was initially introduced for generating PTX code automatically for Ops that aren't supported by LLVM core. Predicates, which are typically not supported in LLVM core, are now supported using the same mechanism. In PTX programming, instructions can be guarded by predicates as shown below:. Here `@p` is a predicate register and guard the execution of the instruction. ``` @p ptx.code op1, op2, op3 ``` This PR introduces the `getPredicate` function in the `BasicPtxBuilder` interface to set an optional predicate. When a predicate is provided, the instruction is generated with predicate and guarded, otherwise, predicate is not genearted. Note that the predicate value must always appear as the last argument on the Op definition. Additionally, this PR implements predicate usage for the following ops: - mbarrier.init - mbarrier.init.shared - mbarrier.arrive.expect_tx - mbarrier.arrive.expect_tx.shared - cp.async.bulk.tensor.shared.cluster.global - cp.async.bulk.tensor.global.shared.cta See for more detail in PTX programing model https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions

The #68728 significantly simplified the accumulator matrix type, making it easier to work with the nvgpu dialect without worrying about the number of required structs, as this information is abstracted away in the nvgpu-to-nvvm transformation. However, we forgot packing the structs after initialization, causing the accumulator matrix to hold undefined values, which is wrong. This PR addresses that.

Fixes #69291. This patch improve the logic handling different patterns to avoid mixing these pattern.

This PR adds `prefetch.tensormap` Op. It brings the cache line containing the given tma descriptor for subsequent use by the cp.async.bulk.tensor instruction. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-prefetch-prefetchu

Add a test that performs warpgroup matrix multiply 128x128x64. The test uses three Ops to do that.

We cannot rely on the address of StringAttr being the same if the stored string is the same.

…s. (#69243) isGDS() and isTFE() need special treatment, because they may be both named-bit and token operands. Part of #62629.

…(#69326) Thsi removes some of the machinery added by D85268, which was unused since D87719 changed all buffer atomic intrinsics to return a value.

It's supposed to return null when an unknown target id is passed.

Currently clang diagnoses the following code: (https://godbolt.org/z/s8zK3E5P5) but nvcc does not. ` struct A { constexpr A(){} }; struct B { A a; int b; }; template<typename T> __global__ void kernel( ) { __shared__ B x; } ` Clang generates an implicit trivial ctor for struct B, which should be allowed for initializing a shared variable. However, the body of the ctor is defined only if the template kernel is instantiated. Clang checks the initialization of variable in non-instantiated templates, where it cannot find the body of the ctor, therefore diagnoses it. This patch skips the check for non-instantiated templates.

As described in: ARM-software/acle#257 Patch by : David Sherwood <[email protected]> Reviewed By: sdesmalen Differential Revision: https://reviews.llvm.org/D150961

…g-info ...behind an experimental CMAKE option that's off by default. This patch adds a new ilist-iterator-like class that can carry two extra bits as well as the usual node pointer. This is part of the project to remove debug-intrinsics from LLVM: see the rationale here [0], they're needed to signal whether a "position" in a BasicBlock includes any debug-info before or after the iterator. This entirely duplicates ilist_iterator, attempting re-use showed it to be a false economy. It's enable-able through the existing ilist_node options interface, hence a few sites where the instruction-list type needs to be updated. The actual main feature, the extra bits in the class, aren't part of the class unless the cmake flag is given: this is because there's a compile-time cost associated with it, and I'd like to get everything in-tree but off-by-default so that we can do proper comparisons. Nothing actually makes use of this yet, but will do soon, see the Phab patch stack. [0] https://discourse.llvm.org/t/rfc-instruction-api-changes-needed-to-eliminate-debug-intrinsics-from-ir/68939 Differential Revision: https://reviews.llvm.org/D153777

Fixes build failures for cases where there's no additional visibility / linkage spec. Differential Revision: https://reviews.llvm.org/D155850

Modify ConstantBounds' methods that handle subscripts and bounds to avoid integer overflows. This is needed to properly handle arrays with the maximum possible upper bound (INT64_MAX).

This updates the documentation on these inline asm constraint codes to match reality. Context: llvm/llvm-project#68818 (comment) Note: dropping also the `'o'` from the docs because I can't find any mention of it in X86ISelLowering.cpp.

Similar to FP16 but we don't have native scalar instruction support, so limit it to vector types only. Fixes #68149

…str (#68908) The svldr_vnum_za and svstr_vnum_za builtins/intrinsics currently require that the vnum argument be an immediate, but since vnum is used to modify the base register via a mul and add, that restriction is not necessary. This patch removes that restriction.

…67727) Ensure that the f18Addendum flag is preserved in AllocatableApplyMold(), that raw().type is reinitialized in AllocatableDeallocatePolymorphic(), and that the implementations of SameTypeAs() and ExtendsTypeOf() handle unallocated unlimited polymorphic arguments correctly.

…69129) Mark any registers as CustomReg and any stack slot as CustomMem. This allows us to more directly emit the register or memory access for the high part. Previously we needed a memory access if the low register was X17 and we assumed the stack offset was 0. If the low part wasn't X17, we assumed the high register was the next register after the low register. This is another part of supporting FP arguments with GISel.

…plicit N1/N2 ops First step towards adding unary/ternary fp ops handling, and not just binops

…314) replaceValuesPerBlockEntry() only handled simple and coerced load values, however the load may also be referenced by a select value. Additionally, I suspect that the previous code might have been incorrect if a load had an offset, as it always constructed the AvailableValue from scratch. Fixes llvm/llvm-project#69301.

Fixes llvm/llvm-project#69369. Fixes clangd/clangd#1700.

The keyword is intended for debugging purpose. It prints a message to stderr. This patch is based on code originally written by Adam Nemet, and on the feedback received by the reviewers in https://reviews.llvm.org/D157492.

This patch fixes: compiler-rt/lib/builtins/cpu_model.c:590:5: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] by adding a missing "break;".

* `dump`, added in llvm/llvm-project#68793 * `!repr`, added in llvm/llvm-project#68716

…#68897) * `dump`, added in llvm/llvm-project#68793 * `!repr`, added in llvm/llvm-project#68716 The keyword `assert` was missing, so I have added that too.

…d x, c)) with Zicond. (#69563) It's only beneficial when cond is setcc with integer equality condition code. For other case, it has same instruction count as the original.

This PR adds `nvvm.stmatrix` Op to NVVM dialect. The Op collectively store one or more matrices across all threads in a warp to the given address location in shared memory.

Add tests with argmem variations.

…tor (#69010) Adds a new `__builtin_vectorelements()` function which returns the number of elements for a given vector either at compile-time for fixed-sized vectors, e.g., created via `__attribute__((vector_size(N)))` or at runtime via a call to `@llvm.vscale.i32()` for scalable vectors, e.g., SVE or RISCV V. The new builtin follows a similar path as `sizeof()`, as it essentially does the same thing but for the number of elements in vector instead of the number of bytes. This allows us to re-use a lot of the existing logic to handle types etc. A small side addition is `Type::isSizelessVectorType()`, which we need to distinguish between sizeless vectors (SVE, RISCV V) and sizeless types (WASM). This is the [corresponding discussion](https://discourse.llvm.org/t/new-builtin-function-to-get-number-of-lanes-in-simd-vectors/73911).

…s. (#69329) A recent commit (#69190) broke the bazel builds. Turns out that Bazel uses symlinks for providing the test files, which the path expansion of the module loading mechanism did not handle correctly. This PR fixes that. It also reorganizes the tests better: It puts all `.mlir` files that are included by some other test into a common `include` folder. This greatly simplifies the definition of the dependencies between the different `.mlir` files in Bazel's `BUILD` file. The commit also adds a comment to all included files why these aren't tested themselves direclty and uses the `%{fs-sep}` expansion for paths more consistently. Finally, it uncomments all but one of the tests excluded in Bazel because they seem to run now. (The remaining one includes a file that it itself a test, so it would have to live *in* and *outside* of the `include` folder.)

…68962) The _mm_cmpistri instruction can be used to quickly parse identifiers. With this patch activated, clang pre-processes <iostream> 1.8% faster, and sqlite3.c amalgametion 1.5% faster, based on time measurements and number of executed instructions as measured by valgrind. The introduction of an extra helper function in the regular case has no impact on performance, see https://llvm-compile-time-tracker.com/compare.php?from=30240e428f0ec7d4a6d1b84f9f807ce12b46cfd1&to=12bcb016cde4579ca7b75397762098c03eb4f264&stat=instructions:u --------- Co-authored-by: serge-sans-paille <[email protected]>

…ize tests. (#69329)" This reverts commit f681225. That commit changed the organization of the tests of the transform dialect interpreter but did not take into account some tests that were added in the meantime.

As described in: ARM-software/acle#257 Patch by : Sander de Smalen<[email protected]> Reviewed By: dtemirbulatov Differential Revision: https://reviews.llvm.org/D151199

…nize tests. (#69329)" This reverts commit c122b97 but fixes tests that were added between submitting #69329 for review and landing it for the first time.

Building helloworld.c currently errors with "undefined symbol: __llvm_libc_syscall" See: llvm/llvm-project#67032

…variables" This reverts commit 3353f7d. Fixed test bug (unspecified order of arg evaluation)

…fault_mem_order REQUIRES clause This patch creates the `OmpRewriteMutator` pass that runs at the end of `RewriteParseTree()`. This pass is intended to make OpenMP-specific mutations to the PFT after name resolution. In the case of the `atomic_default_mem_order` clause of the REQUIRES directive, name resolution results in populating global symbols with information about the REQUIRES clauses that apply to that scope. The new rewrite pass is then able to use this information in order to explicitly set the memory order of ATOMIC constructs for which that is not already specified. Given that this rewrite happens before semantics checks, the check of the order in which ATOMIC constructs without explicit memory order and REQUIRES directives with `atomic_default_mem_order` appear is moved earlier into the rewrite pass. Otherwise, these problems would not be caught by semantics checks, since the PFT would be modified by that stage. This is patch 4/5 of a series splitting D149337 to simplify review. Depends on D157983. Differential Revision: https://reviews.llvm.org/D158096

To fix that ticket we only needed to address the V_LSHLREV_B16 case, but I did it for all insts just in case. Fixes #66899

Small fix for failing tests after merge of #69010. The tests need `REQUIRES` to ensure that the correct headers are available. I've also added a generic x86 build which does not need headers, so there is at least one run per test.

When the FPU was selected with "+(no)fp(.dp)" extensions in "-march" or "-mcpu" options, the FPU used for multilib selection was still the default one for given architecture or CPU.

We already save the information about signedness ourselves.

As described in: ARM-software/acle#257 Patch by : David Sherwood <[email protected]> Reviewed By: kmclaughlin Differential Revision: https://reviews.llvm.org/D151307

Similar to what we already do for add/sub + saturation variants. Scalar support will be added in a future patch covering the other variants at the same time. Alive2: https://alive2.llvm.org/ce/z/rBDrNE Fixes #69080

…9539) The immediate legality checks are now embedded into the isOperandLegal(). It is not needed to check it again.

…merge-upstream-20231019

RKSimon and others added 30 commits October 17, 2023 11:34

[X86] vselect.ll - add vXi8 select-by-constant tests with repeated/br…

be9bc54

…oadcastable shuffle mask

[InstCombine] Don't mix X << Y / Z << Y with X << Y / X << Z (#69302)

f2898de

Fixes #69291. This patch improve the logic handling different patterns to avoid mixing these pattern.

[MLIR][NVGPU] Test warpgroup matrix multiply 128x128x64 (#68817)

b736e04

Add a test that performs warpgroup matrix multiply 128x128x64. The test uses three Ops to do that.

[mlir][Tosa] Fix test failure when running with Asan.

7f2dd2d

We cannot rely on the address of StringAttr being the same if the stored string is the same.

[gn] port 3694697

e730f4a

[gn] port 3694697

8b5625c

[Bazel] disable preload-library.mlir test

4434253

[Bazel] fix typo

9057608

[Bazel] Fix dependencies for clang codegen

12bf423

[AMDGPU][AsmParser] Eliminate custom predicates for named-bit operand…

509b570

…s. (#69243) isGDS() and isTFE() need special treatment, because they may be both named-bit and token operands. Part of #62629.

[AMDGPU] Remove support for no-return buffer atomic intrinsics. NFC. …

bff17f9

…(#69326) Thsi removes some of the machinery added by D85268, which was unused since D87719 changed all buffer atomic intrinsics to return a value.

[TargetParser][AMDGPU] Fix getArchEntry(). (#69222)

096eba1

It's supposed to return null when an unknown target id is passed.

[Clang][SVE2.1] Add svcntp prototype

81d8fa5

As described in: ARM-software/acle#257 Patch by : David Sherwood <[email protected]> Reviewed By: sdesmalen Differential Revision: https://reviews.llvm.org/D150961

[LoongArch] Precommit a test for atomic cmpxchg optmization

b2773d1

[HIP][Clang][CodeGen] Simplify test for hipstdpar

791b890

Fixes build failures for cases where there's no additional visibility / linkage spec. Differential Revision: https://reviews.llvm.org/D155850

[flang] Fix constant subscript operations (#68352)

c385981

Modify ConstantBounds' methods that handle subscripts and bounds to avoid integer overflows. This is needed to properly handle arrays with the maximum possible upper bound (INT64_MAX).

[gn build] Port 088d272

4df46c3

[X86] Enable bfloat type support in inline assembly constraints (#68469)

3d6e416

Similar to FP16 but we don't have native scalar instruction support, so limit it to vector types only. Fixes #68149

[AMDGPU] Simplify definition of SIbuffer_atomic_*. NFC.

a559de0

[DAG] foldConstantFPMath - accept ArrayRef<SDValue> Ops instead of ex…

2f329d8

…plicit N1/N2 ops First step towards adding unary/ternary fp ops handling, and not just binops

nikic and others added 28 commits October 19, 2023 09:08

[X86] Support -march=pantherlake,clearwaterforest (#69277)

278e533

[clangd] Disable crashy unchecked-optional-access tidy check (#69427)

e63ab13

Fixes llvm/llvm-project#69369. Fixes clangd/clangd#1700.

[Tablegen] Add keyword dump. (#68793)

db9b6f4

The keyword is intended for debugging purpose. It prints a message to stderr. This patch is based on code originally written by Adam Nemet, and on the feedback received by the reviewers in https://reviews.llvm.org/D157492.

[compiler-rt] Fix a warning

4d8ea66

This patch fixes: compiler-rt/lib/builtins/cpu_model.c:590:5: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] by adding a missing "break;".

[ReleaseNotes][TableGen] Add dump and !repr. (#68893)

411c4ed

* `dump`, added in llvm/llvm-project#68793 * `!repr`, added in llvm/llvm-project#68716

[TableGen] Update editor modes for new keywords and bang operators. (…

4168845

…#68897) * `dump`, added in llvm/llvm-project#68793 * `!repr`, added in llvm/llvm-project#68716 The keyword `assert` was missing, so I have added that too.

[RISCV] Combine (and (select cond, x, -1), c) to (select cond, x, (an…

5341d54

…d x, c)) with Zicond. (#69563) It's only beneficial when cond is setcc with integer equality condition code. For other case, it has same instruction count as the original.

[mlir][nvvm] Introduce nvvm.stmatrix Op (#69467)

5316d19

This PR adds `nvvm.stmatrix` Op to NVVM dialect. The Op collectively store one or more matrices across all threads in a warp to the given address location in shared memory.

[FunctionAttrs] Add additional tests for writeonly (NFC)

9ea2fd2

Add tests with argmem variations.

[Clang][SVE2.1] Add pfalse builtin

ba47bc7

As described in: ARM-software/acle#257 Patch by : Sander de Smalen<[email protected]> Reviewed By: dtemirbulatov Differential Revision: https://reviews.llvm.org/D151199

Reapply "[mlir][transform] Support symlinks in module loading. Reorga…

3517b67

…nize tests. (#69329)" This reverts commit c122b97 but fixes tests that were added between submitting #69329 for review and landing it for the first time.

[libc] Fix accidental LIBC_NAMESPACE_syscall definition (#69548)

d404130

Building helloworld.c currently errors with "undefined symbol: __llvm_libc_syscall" See: llvm/llvm-project#67032

Reapply "[dataflow] use true/false literals in formulas, rather than …

7338eb5

…variables" This reverts commit 3353f7d. Fixed test bug (unspecified order of arg evaluation)

[clang][Interp][NFC] Add more tests for bitfield initializers

2b97fe2

[AMDGPU] PeepholeSDWA: Don't assume inst srcs are registers (#69576)

d2edff8

To fix that ticket we only needed to address the V_LSHLREV_B16 case, but I did it for all insts just in case. Fixes #66899

[ARM] fix "+fp.dp" in multilib selection (#67412)

b3c4f64

When the FPU was selected with "+(no)fp(.dp)" extensions in "-march" or "-mcpu" options, the FPU used for multilib selection was still the default one for given architecture or CPU.

[clang][Interp][NFC] Use an APInt instead of APSint

0e1d723

We already save the information about signedness ourselves.

[Clang][SVE2.1] Add svwhile (predicate-as-counter) builtins

4ed0dfe

As described in: ARM-software/acle#257 Patch by : David Sherwood <[email protected]> Reviewed By: kmclaughlin Differential Revision: https://reviews.llvm.org/D151307

[DAG] Expand vXi1 add/sub overflow operations as xor/and (#69191)

c43ac32

Similar to what we already do for add/sub + saturation variants. Scalar support will be added in a future patch covering the other variants at the same time. Alive2: https://alive2.llvm.org/ce/z/rBDrNE Fixes #69080

[AMDGPU] Remove legality checks from imm folding in shrink. NFCI. (#6…

906d3ff

…9539) The immediate legality checks are now embedded into the isOperandLegal(). It is not needed to check it again.

Merge commit '906d3ff054b0ca6439521e8871bac54843cc73a2' into feature/…

f35f990

…merge-upstream-20231019

kaz7 merged commit bbf4055 into develop Jan 24, 2024
3 of 4 checks passed

kaz7 deleted the feature/merge-upstream-20231019 branch January 24, 2024 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/merge upstream 20231019 #287

Feature/merge upstream 20231019 #287

kaz7 commented Jan 24, 2024

Feature/merge upstream 20231019 #287

Feature/merge upstream 20231019 #287

Conversation

kaz7 commented Jan 24, 2024