Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/merge upstream 20231019 #287

Merged
merged 1,066 commits into from
Jan 24, 2024
Merged

Feature/merge upstream 20231019 #287

merged 1,066 commits into from
Jan 24, 2024

Conversation

kaz7
Copy link
Collaborator

@kaz7 kaz7 commented Jan 24, 2024

Merge upstream's main up to 2023/10/19.

This passed the internal regression tests already.

RKSimon and others added 30 commits October 17, 2023 11:34
This patch adds the CodeGen changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. This change relaxes restrictions on what gets emitted on the device path, when compiling in `hipstdpar` mode:

1. Unless a function is explicitly marked `__host__`, it will get emitted, whereas before only `__device__` and `__global__` functions would be emitted;
2. Unsupported builtins are ignored as opposed to being marked as an error, as the decision on their validity is deferred to the `hipstdpar` specific code selection pass;
3. We add a `hipstdpar` specific pass to the opt pipeline, independent of optimisation level:
    - When compiling for the host, iff the user requested it via the `--hipstdpar-interpose-alloc` flag, we add a pass which replaces canonical allocation / deallocation functions with accelerator aware equivalents.

A test to validate that unannotated functions get correctly emitted is added as well.

Reviewed by: yaxunl, efriedma

Differential Revision: https://reviews.llvm.org/D155850
This PR enhances `BasicPtxBuilder` to support predicates in PTX code
generation. The `BasicPtxBuilder` interface was initially introduced for
generating PTX code automatically for Ops that aren't supported by LLVM
core. Predicates, which are typically not supported in LLVM core, are
now supported using the same mechanism.

In PTX programming, instructions can be guarded by predicates as shown
below:. Here `@p` is a predicate register and guard the execution of the
instruction.

```
@p ptx.code op1, op2, op3
```

This PR introduces the `getPredicate` function in the `BasicPtxBuilder`
interface to set an optional predicate. When a predicate is provided,
the instruction is generated with predicate and guarded, otherwise,
predicate is not genearted. Note that the predicate value must always
appear as the last argument on the Op definition.

Additionally, this PR implements predicate usage for the following ops:

- mbarrier.init
- mbarrier.init.shared
- mbarrier.arrive.expect_tx
- mbarrier.arrive.expect_tx.shared
- cp.async.bulk.tensor.shared.cluster.global
- cp.async.bulk.tensor.global.shared.cta

See for more detail in PTX programing model

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions
The #68728 significantly simplified the accumulator matrix type, making
it easier to work with the nvgpu dialect without worrying about the
number of required structs, as this information is abstracted away in
the nvgpu-to-nvvm transformation.

However, we forgot packing the structs after initialization, causing the
accumulator matrix to hold undefined values, which is wrong. This PR
addresses that.
Fixes #69291.
This patch improve the logic handling different patterns to avoid mixing these
pattern.
This PR adds `prefetch.tensormap` Op. It brings the cache line
containing the given tma descriptor for subsequent use by the
cp.async.bulk.tensor instruction.


https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-prefetch-prefetchu
Add a test that performs warpgroup matrix multiply 128x128x64. The test
uses three Ops to do that.
We cannot rely on the address of StringAttr being the same if the stored
string is the same.
…s. (#69243)

isGDS() and isTFE() need special treatment, because they may be both
named-bit and token operands.

Part of #62629.
…(#69326)

Thsi removes some of the machinery added by D85268, which was unused
since D87719 changed all buffer atomic intrinsics to return a value.
It's supposed to return null when an unknown target id is passed.
Currently clang diagnoses the following code:
(https://godbolt.org/z/s8zK3E5P5) but nvcc
does not.

`
struct A {
   constexpr A(){}
};

struct  B {
  A a;
  int b;
};

template<typename T>
__global__ void kernel( )
{
   __shared__ B x;
}
`

Clang generates an implicit trivial ctor for struct B, which should be
allowed for initializing a shared variable.

However, the body of the ctor is defined only if the template kernel is
instantiated. Clang checks the initialization of variable in
non-instantiated templates, where it cannot find the body of the ctor,
therefore diagnoses it.

This patch skips the check for non-instantiated templates.
As described in: ARM-software/acle#257

Patch by : David Sherwood <[email protected]>

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D150961
…g-info

...behind an experimental CMAKE option that's off by default.

This patch adds a new ilist-iterator-like class that can carry two extra bits
as well as the usual node pointer. This is part of the project to remove
debug-intrinsics from LLVM: see the rationale here [0], they're needed to
signal whether a "position" in a BasicBlock includes any debug-info before or
after the iterator.

This entirely duplicates ilist_iterator, attempting re-use showed it to be a
false economy. It's enable-able through the existing ilist_node options
interface, hence a few sites where the instruction-list type needs to be
updated. The actual main feature, the extra bits in the class, aren't part of
the class unless the cmake flag is given: this is because there's a
compile-time cost associated with it, and I'd like to get everything in-tree
but off-by-default so that we can do proper comparisons.

Nothing actually makes use of this yet, but will do soon, see the Phab patch
stack.

[0] https://discourse.llvm.org/t/rfc-instruction-api-changes-needed-to-eliminate-debug-intrinsics-from-ir/68939

Differential Revision: https://reviews.llvm.org/D153777
Fixes build failures for cases where there's no additional visibility / linkage spec.

Differential Revision: https://reviews.llvm.org/D155850
Modify ConstantBounds' methods that handle subscripts and bounds to
avoid integer overflows. This is needed to properly handle arrays
with the maximum possible upper bound (INT64_MAX).
This updates the documentation on these inline asm constraint codes to
match reality. Context:
llvm/llvm-project#68818 (comment)

Note: dropping also the `'o'` from the docs because I can't find any
mention of it in X86ISelLowering.cpp.
Similar to FP16 but we don't have native scalar instruction support, so
limit it to vector types only.

Fixes #68149
…str (#68908)

The svldr_vnum_za and svstr_vnum_za builtins/intrinsics currently
require that the vnum argument be an immediate, but since vnum is used
to modify the base register via a mul and add, that restriction is not
necessary. This patch removes that restriction.
…67727)

Ensure that the f18Addendum flag is preserved in AllocatableApplyMold(),
that raw().type is reinitialized in AllocatableDeallocatePolymorphic(),
and that the implementations of SameTypeAs() and ExtendsTypeOf() handle
unallocated unlimited polymorphic arguments correctly.
…69129)

Mark any registers as CustomReg and any stack slot as CustomMem.
    
This allows us to more directly emit the register or memory access for
the high part. Previously we needed a memory access if the low register
was X17 and we assumed the stack offset was 0. If the low part wasn't
X17, we assumed the high register was the next register after the low
register.
    
This is another part of supporting FP arguments with GISel.
…plicit N1/N2 ops

First step towards adding unary/ternary fp ops handling, and not just binops
nikic and others added 28 commits October 19, 2023 09:08
…314)

replaceValuesPerBlockEntry() only handled simple and coerced load
values, however the load may also be referenced by a select value.

Additionally, I suspect that the previous code might have been incorrect
if a load had an offset, as it always constructed the AvailableValue
from scratch.

Fixes llvm/llvm-project#69301.
The keyword is intended for debugging purpose. It prints a message to
stderr.

This patch is based on code originally written by Adam Nemet, and on the
feedback received by the reviewers in
https://reviews.llvm.org/D157492.
This patch fixes:

  compiler-rt/lib/builtins/cpu_model.c:590:5: error: unannotated
  fall-through between switch labels [-Werror,-Wimplicit-fallthrough]

by adding a missing "break;".
…#68897)

* `dump`, added in llvm/llvm-project#68793
* `!repr`, added in llvm/llvm-project#68716

The keyword `assert` was missing, so I have added that too.
…d x, c)) with Zicond. (#69563)

It's only beneficial when cond is setcc with integer equality condition
code. For other case, it has same instruction count as the original.
This PR adds `nvvm.stmatrix` Op to NVVM dialect. The Op collectively
store one or more matrices across all threads in a warp to the given
address location in shared memory.
…tor (#69010)

Adds a new `__builtin_vectorelements()` function which returns the
number of elements for a given vector either at compile-time for
fixed-sized vectors, e.g., created via `__attribute__((vector_size(N)))`
or at runtime via a call to `@llvm.vscale.i32()` for scalable vectors,
e.g., SVE or RISCV V.

The new builtin follows a similar path as `sizeof()`, as it essentially
does the same thing but for the number of elements in vector instead of
the number of bytes. This allows us to re-use a lot of the existing
logic to handle types etc.

A small side addition is `Type::isSizelessVectorType()`, which we need
to distinguish between sizeless vectors (SVE, RISCV V) and sizeless
types (WASM).

This is the [corresponding
discussion](https://discourse.llvm.org/t/new-builtin-function-to-get-number-of-lanes-in-simd-vectors/73911).
…s. (#69329)

A recent commit (#69190) broke the bazel builds. Turns out that Bazel
uses symlinks for providing the test files, which the path expansion of
the module loading mechanism did not handle correctly. This PR fixes
that.

It also reorganizes the tests better: It puts all `.mlir` files that are
included by some other test into a common `include` folder. This greatly
simplifies the definition of the dependencies between the different
`.mlir` files in Bazel's `BUILD` file. The commit also adds a comment to
all included files why these aren't tested themselves direclty and uses
the `%{fs-sep}` expansion for paths more consistently. Finally, it
uncomments all but one of the tests excluded in Bazel because they seem
to run now. (The remaining one includes a file that it itself a test, so
it would have to live *in* and *outside* of the `include` folder.)
…68962)

The _mm_cmpistri instruction can be used to quickly parse identifiers.

With this patch activated, clang pre-processes <iostream> 1.8% faster,
and sqlite3.c amalgametion 1.5% faster, based on time measurements and
number of executed instructions as measured by valgrind.

The introduction of an extra helper function in the regular case has no
impact on performance, see


https://llvm-compile-time-tracker.com/compare.php?from=30240e428f0ec7d4a6d1b84f9f807ce12b46cfd1&to=12bcb016cde4579ca7b75397762098c03eb4f264&stat=instructions:u

---------

Co-authored-by: serge-sans-paille <[email protected]>
…ize tests. (#69329)"

This reverts commit f681225. That
commit changed the organization of the tests of the transform dialect
interpreter but did not take into account some tests that were added in
the meantime.
As described in: ARM-software/acle#257

Patch by : Sander de Smalen<[email protected]>

Reviewed By: dtemirbulatov

Differential Revision: https://reviews.llvm.org/D151199
…nize tests. (#69329)"

This reverts commit c122b97 but fixes
tests that were added between submitting #69329 for review and landing
it for the first time.
Building helloworld.c currently errors with "undefined symbol:
__llvm_libc_syscall"

See: llvm/llvm-project#67032
…variables"

This reverts commit 3353f7d.

Fixed test bug (unspecified order of arg evaluation)
…fault_mem_order REQUIRES clause

This patch creates the `OmpRewriteMutator` pass that runs at the end of
`RewriteParseTree()`. This pass is intended to make OpenMP-specific mutations
to the PFT after name resolution.

In the case of the `atomic_default_mem_order` clause of the REQUIRES directive,
name resolution results in populating global symbols with information about the
REQUIRES clauses that apply to that scope. The new rewrite pass is then able to
use this information in order to explicitly set the memory order of ATOMIC
constructs for which that is not already specified.

Given that this rewrite happens before semantics checks, the check of the order
in which ATOMIC constructs without explicit memory order and REQUIRES
directives with `atomic_default_mem_order` appear is moved earlier into the
rewrite pass. Otherwise, these problems would not be caught by semantics
checks, since the PFT would be modified by that stage.

This is patch 4/5 of a series splitting D149337 to simplify review.

Depends on D157983.

Differential Revision: https://reviews.llvm.org/D158096
To fix that ticket we only needed to address the V_LSHLREV_B16 case, but
I did it for all insts just in case.

Fixes #66899
Small fix for failing tests after merge of #69010. The tests need
`REQUIRES` to ensure that the correct headers are available. I've also
added a generic x86 build which does not need headers, so there is at
least one run per test.
When the FPU was selected with "+(no)fp(.dp)" extensions in "-march" or
"-mcpu" options, the FPU used for multilib selection was still the
default one for given architecture or CPU.
We already save the information about signedness ourselves.
As described in: ARM-software/acle#257

Patch by : David Sherwood <[email protected]>

Reviewed By: kmclaughlin

Differential Revision: https://reviews.llvm.org/D151307
Similar to what we already do for add/sub + saturation variants.

Scalar support will be added in a future patch covering the other variants at the same time.

Alive2: https://alive2.llvm.org/ce/z/rBDrNE

Fixes #69080
…9539)

The immediate legality checks are now embedded into the
isOperandLegal(). It is not needed to check it again.
@kaz7 kaz7 merged commit bbf4055 into develop Jan 24, 2024
3 of 4 checks passed
@kaz7 kaz7 deleted the feature/merge-upstream-20231019 branch January 24, 2024 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment