Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GitHub CI #1

Merged
merged 1 commit into from
Sep 10, 2023
Merged

Add GitHub CI #1

merged 1 commit into from
Sep 10, 2023

Conversation

cooljeanius
Copy link
Owner

From @talregev

@cooljeanius cooljeanius merged commit e6f92c6 into cooljeanius:me/CI Sep 10, 2023
cooljeanius pushed a commit that referenced this pull request Sep 29, 2023
This patch decreses one machine instruction from "single bit extraction
with shifting" operation, and tries to eliminate the conditional
branch if CST2_POW2 doesn't fit into signed 12 bits with the help
of ifcvt optimization.

    /* example #1 */
    int test0(int x) {
      return (x & 1048576) != 0 ? 1024 : 0;
    }
    extern int foo(void);
    int test1(void) {
      return (foo() & 1048576) != 0 ? 16777216 : 0;
    }

    ;; before
    test0:
	movi	a9, 0x400
	srai	a2, a2, 10
	and	a2, a2, a9
	ret.n
    test1:
	addi	sp, sp, -16
	s32i.n	a0, sp, 12
	call0	foo
	extui	a2, a2, 20, 1
	slli	a2, a2, 20
	beqz.n	a2, .L2
	movi.n	a2, 1
	slli	a2, a2, 24
    .L2:
	l32i.n	a0, sp, 12
	addi	sp, sp, 16
	ret.n

    ;; after
    test0:
	extui	a2, a2, 20, 1
	slli	a2, a2, 10
	ret.n
    test1:
	addi	sp, sp, -16
	s32i.n	a0, sp, 12
	call0	foo
	l32i.n	a0, sp, 12
	extui	a2, a2, 20, 1
	slli	a2, a2, 24
	addi	sp, sp, 16
	ret.n

In addition, if the left shift amount ('exact_log2(CST2_POW2)') is
between 1 through 3 and a either addition or subtraction with another
register follows, emit a ADDX[248] or SUBX[248] machine instruction
instead of separate left shift and add/subtract ones.

    /* example #2 */
    int test2(int x, int y) {
      return ((x & 1048576) != 0 ? 4 : 0) + y;
    }
    int test3(int x, int y) {
      return ((x & 2) != 0 ? 8 : 0) - y;
    }

    ;; before
    test2:
	movi.n	a9, 4
	srai	a2, a2, 18
	and	a2, a2, a9
	add.n	a2, a2, a3
	ret.n
    test3:
	movi.n	a9, 8
	slli	a2, a2, 2
	and	a2, a2, a9
	sub	a2, a2, a3
	ret.n

    ;; after
    test2:
	extui	a2, a2, 20, 1
	addx4	a2, a2, a3
	ret.n
    test3:
	extui	a2, a2, 1, 1
	subx8	a2, a2, a3
	ret.n

gcc/ChangeLog:

	* config/xtensa/predicates.md (addsub_operator): New.
	* config/xtensa/xtensa.md (*extzvsi-1bit_ashlsi3,
	*extzvsi-1bit_addsubx): New insn_and_split patterns.
	* config/xtensa/xtensa.cc (xtensa_rtx_costs):
	Add a special case about ifcvt 'noce_try_cmove()' to handle
	constant loads that do not fit into signed 12 bits in the
	patterns added above.
cooljeanius pushed a commit that referenced this pull request Sep 29, 2023
In plenty of image and video processing code it's common to modify pixel values
by a widening operation and then scale them back into range by dividing by 255.

This patch adds an named function to allow us to emit an optimized sequence
when doing an unsigned division that is equivalent to:

   x = y / (2 ^ (bitsize (y)/2)-1)

For SVE2 this means we generate for:

void draw_bitmap1(uint8_t* restrict pixel, uint8_t level, int n)
{
  for (int i = 0; i < (n & -16); i+=1)
    pixel[i] = (pixel[i] * level) / 0xff;
}

the following:

        mov     z3.b, #1
.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        addhnb  z1.b, z0.h, z3.h
        addhnb  z0.b, z0.h, z1.h
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3

instead of:

.L3:
        ld1b    z0.h, p1/z, [x0, x3]
        mul     z0.h, p0/m, z0.h, z1.h
        umulh   z0.h, p0/m, z0.h, z2.h
        lsr     z0.h, z0.h, #7
        st1b    z0.h, p1, [x0, x3]
        inch    x3
        whilelo p1.h, w3, w2
        b.any   .L3

Which results in significantly faster code.

gcc/ChangeLog:

	* config/aarch64/aarch64-sve2.md (@aarch64_bitmask_udiv<mode>3): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve2/div-by-bitmask_1.c: New test.
cooljeanius pushed a commit that referenced this pull request Sep 29, 2023
A friend declaration can only have constraints if it is defined.  If
multiple instantiations of a class template define the same friend function
signature, it's an error, but that shouldn't happen if it's constrained to
only be declared in one instantiation.

Currently we don't mangle requirements, so the foos all mangle the same and
actually instantiating #1 will break, but for now we can test that they're
considered distinct.

gcc/cp/ChangeLog:

	* pt.cc (tsubst_friend_function): Check satisfaction.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp2a/concepts-friend11.C: New test.
cooljeanius pushed a commit that referenced this pull request Sep 29, 2023
To improve compile times, the C++ library could use compiler built-ins
rather than implementing std::is_convertible (and _nothrow) as class
templates.  This patch adds the built-ins.  We already have
__is_constructible and __is_assignable, and the nothrow forms of those.

Microsoft (and clang, for compatibility) also provide an alias called
__is_convertible_to.  I did not add it, but it would be trivial to do
so.

I noticed that our __is_assignable doesn't implement the "Access checks
are performed as if from a context unrelated to either type" requirement,
therefore std::is_assignable / __is_assignable give two different results
here:

  class S {
    operator int();
    friend void g(); // #1
  };

  void
  g ()
  {
    // #1 doesn't matter
    static_assert(std::is_assignable<int&, S>::value, "");
    static_assert(__is_assignable(int&, S), "");
  }

This is not a problem if __is_assignable is not meant to be used by
the users.

This patch doesn't make libstdc++ use the new built-ins, but I had to
rename a class otherwise its name would clash with the new built-in.

	PR c++/106784

gcc/c-family/ChangeLog:

	* c-common.cc (c_common_reswords): Add __is_convertible and
	__is_nothrow_convertible.
	* c-common.h (enum rid): Add RID_IS_CONVERTIBLE and
	RID_IS_NOTHROW_CONVERTIBLE.

gcc/cp/ChangeLog:

	* constraint.cc (diagnose_trait_expr): Handle CPTK_IS_CONVERTIBLE
	and CPTK_IS_NOTHROW_CONVERTIBLE.
	* cp-objcp-common.cc (names_builtin_p): Handle RID_IS_CONVERTIBLE
	RID_IS_NOTHROW_CONVERTIBLE.
	* cp-tree.h (enum cp_trait_kind): Add CPTK_IS_CONVERTIBLE and
	CPTK_IS_NOTHROW_CONVERTIBLE.
	(is_convertible): Declare.
	(is_nothrow_convertible): Likewise.
	* cxx-pretty-print.cc (pp_cxx_trait_expression): Handle
	CPTK_IS_CONVERTIBLE and CPTK_IS_NOTHROW_CONVERTIBLE.
	* method.cc (is_convertible): New.
	(is_nothrow_convertible): Likewise.
	* parser.cc (cp_parser_primary_expression): Handle RID_IS_CONVERTIBLE
	and RID_IS_NOTHROW_CONVERTIBLE.
	(cp_parser_trait_expr): Likewise.
	* semantics.cc (trait_expr_value): Handle CPTK_IS_CONVERTIBLE and
	CPTK_IS_NOTHROW_CONVERTIBLE.
	(finish_trait_expr): Likewise.

libstdc++-v3/ChangeLog:

	* include/std/type_traits: Rename __is_nothrow_convertible to
	__is_nothrow_convertible_lib.
	* testsuite/20_util/is_nothrow_convertible/value_ext.cc: Likewise.

gcc/testsuite/ChangeLog:

	* g++.dg/ext/has-builtin-1.C: Enhance to test __is_convertible and
	__is_nothrow_convertible.
	* g++.dg/ext/is_convertible1.C: New test.
	* g++.dg/ext/is_convertible2.C: New test.
	* g++.dg/ext/is_nothrow_convertible1.C: New test.
	* g++.dg/ext/is_nothrow_convertible2.C: New test.
cooljeanius pushed a commit that referenced this pull request Oct 28, 2023
This patch is my proposed solution to PR rtl-optimization/91865.
Normally RTX simplification canonicalizes a ZERO_EXTEND of a ZERO_EXTEND
to a single ZERO_EXTEND, but as shown in this PR it is possible for
combine's make_compound_operation to unintentionally generate a
non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is unlikely to be
matched by the backend.

For the new test case:

const int table[2] = {1, 2};
int foo (char i) { return table[i]; }

compiling with -O2 -mlarge on msp430 we currently see:

Trying 2 -> 7:
    2: r25:HI=zero_extend(R12:QI)
      REG_DEAD R12:QI
    7: r28:PSI=sign_extend(r25:HI)#0
      REG_DEAD r25:HI
Failed to match this instruction:
(set (reg:PSI 28 [ iD.1772 ])
    (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ]))))

which results in the following code:

foo:	AND     #0xff, R12
        RLAM.A #4, R12 { RRAM.A #4, R12
        RLAM.A  #1, R12
        MOVX.W  table(R12), R12
        RETA

With this patch, we now see:

Trying 2 -> 7:
    2: r25:HI=zero_extend(R12:QI)
      REG_DEAD R12:QI
    7: r28:PSI=sign_extend(r25:HI)#0
      REG_DEAD r25:HI
Successfully matched this instruction:
(set (reg:PSI 28 [ iD.1772 ])
    (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ])))
allowing combination of insns 2 and 7
original costs 4 + 8 = 12
replacement cost 8

foo:	MOV.B   R12, R12
        RLAM.A  #1, R12
        MOVX.W  table(R12), R12
        RETA

2023-10-26  Roger Sayle  <[email protected]>
	    Richard Biener  <[email protected]>

gcc/ChangeLog
	PR rtl-optimization/91865
	* combine.cc (make_compound_operation): Avoid creating a
	ZERO_EXTEND of a ZERO_EXTEND.

gcc/testsuite/ChangeLog
	PR rtl-optimization/91865
	* gcc.target/msp430/pr91865.c: New test case.
cooljeanius pushed a commit that referenced this pull request Dec 18, 2023
Since the last import from upstream libsanitizer, the output has changed
and now looks more like this:

READ of size 6 at 0x7ff7beb2a144 thread T0
    #0 0x101cf7796 in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long) sanitizer_common_interceptors.inc:813
    #1 0x101cf7b99 in memcmp sanitizer_common_interceptors.inc:840
    #2 0x108a0c39f in __stack_chk_guard+0xf (dyld:x86_64+0x8039f)

so let's adjust the pattern accordingly.

gcc/testsuite/ChangeLog:

	* c-c++-common/asan/memcmp-1.c: Adjust pattern on darwin.
cooljeanius pushed a commit that referenced this pull request Dec 23, 2023
During partial ordering, we want to look through dependent alias
template specializations within template arguments and otherwise
treat them as opaque in other contexts (see e.g. r7-7116-g0c942f3edab108
and r11-7011-g6e0a231a4aa240).  To that end template_args_equal was
given a partial_order flag that controls this behavior.  This flag
does the right thing when a dependent alias template specialization
appears as template argument of the partial specialization, e.g. in

  template<class T, class...> using first_t = T;
  template<class T> struct traits;
  template<class T> struct traits<first_t<T, T&>> { }; // #1
  template<class T> struct traits<first_t<const T, T&>> { }; // #2

we correctly consider #2 to be more specialized than #1.  But if the
alias specialization appears as a nested template argument of another
class template specialization, e.g. in

  template<class T> struct traits<A<first_t<T, T&>>> { }; // #1
  template<class T> struct traits<A<first_t<const T, T&>>> { }; // #2

then we incorrectly consider #1 and #2 to be unordered.  This is because

  1. we don't propagate the flag to recursive template_args_equal calls
  2. we don't use structural equality for class template specializations
     written in terms of dependent alias template specializations

This patch fixes the first issue by turning the partial_order flag into
a global.  This patch fixes the second issue by making us propagate
structural equality appropriately when building a class template
specialization.  In passing this patch also improves hashing of
specializations that use structural equality.

	PR c++/90679

gcc/cp/ChangeLog:

	* cp-tree.h (comp_template_args): Remove partial_order parameter.
	(template_args_equal): Likewise.
	* pt.cc (comparing_for_partial_ordering): New global flag.
	(iterative_hash_template_arg) <case tcc_type>: Hash the template
	and arguments for specializations that use structural equality.
	(template_args_equal): Remove partial order parameter and
	use comparing_for_partial_ordering instead.
	(comp_template_args): Likewise.
	(comp_template_args_porder): Set comparing_for_partial_ordering
	instead.  Make static.
	(any_template_arguments_need_structural_equality_p): Return true
	for an argument that's a dependent alias template specialization
	or a class template specialization that itself needs structural
	equality.
	* tree.cc (cp_tree_equal) <case TREE_VEC>: Adjust call to
	comp_template_args.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/alias-decl-75a.C: New test.
	* g++.dg/cpp0x/alias-decl-75b.C: New test.
cooljeanius pushed a commit that referenced this pull request Jan 16, 2024
Hi All,

This patch adds initial support for early break vectorization in GCC. In other
words it implements support for vectorization of loops with multiple exits.
The support is added for any target that implements a vector cbranch optab,
this includes both fully masked and non-masked targets.

Depending on the operation, the vectorizer may also require support for boolean
mask reductions using Inclusive OR/Bitwise AND.  This is however only checked
then the comparison would produce multiple statements.

This also fully decouples the vectorizer's notion of exit from the existing loop
infrastructure's exit.  Before this patch the vectorizer always picked the
natural loop latch connected exit as the main exit.

After this patch the vectorizer is free to choose any exit it deems appropriate
as the main exit.  This means that even if the main exit is not countable (i.e.
the termination condition could not be determined) we might still be able to
vectorize should one of the other exits be countable.

In such situations the loop is reflowed which enabled vectorization of many
other loop forms.

Concretely the kind of loops supported are of the forms:

 for (int i = 0; i < N; i++)
 {
   <statements1>
   if (<condition>)
     {
       ...
       <action>;
     }
   <statements2>
 }

where <action> can be:
 - break
 - return
 - goto

Any number of statements can be used before the <action> occurs.

Since this is an initial version for GCC 14 it has the following limitations and
features:

- Only fixed sized iterations and buffers are supported.  That is to say any
  vectors loaded or stored must be to statically allocated arrays with known
  sizes. N must also be known.  This limitation is because our primary target
  for this optimization is SVE.  For VLA SVE we can't easily do cross page
  iteraion checks. The result is likely to also not be beneficial. For that
  reason we punt support for variable buffers till we have First-Faulting
  support in GCC 15.
- any stores in <statements1> should not be to the same objects as in
  <condition>.  Loads are fine as long as they don't have the possibility to
  alias.  More concretely, we block RAW dependencies when the intermediate value
  can't be separated fromt the store, or the store itself can't be moved.
- Prologue peeling, alignment peelinig and loop versioning are supported.
- Fully masked loops, unmasked loops and partially masked loops are supported
- Any number of loop early exits are supported.
- No support for epilogue vectorization.  The only epilogue supported is the
  scalar final one.  Peeling code supports it but the code motion code cannot
  find instructions to make the move in the epilog.
- Early breaks are only supported for inner loop vectorization.

With the help of IPA and LTO this still gets hit quite often.  During bootstrap
it hit rather frequently.  Additionally TSVC s332, s481 and s482 all pass now
since these are tests for support for early exit vectorization.

This implementation does not support completely handling the early break inside
the vector loop itself but instead supports adding checks such that if we know
that we have to exit in the current iteration then we branch to scalar code to
actually do the final VF iterations which handles all the code in <action>.

For the scalar loop we know that whatever exit you take you have to perform at
most VF iterations.  For vector code we only case about the state of fully
performed iteration and reset the scalar code to the (partially) remaining loop.

That is to say, the first vector loop executes so long as the early exit isn't
needed.  Once the exit is taken, the scalar code will perform at most VF extra
iterations.  The exact number depending on peeling and iteration start and which
exit was taken (natural or early).   For this scalar loop, all early exits are
treated the same.

When we vectorize we move any statement not related to the early break itself
and that would be incorrect to execute before the break (i.e. has side effects)
to after the break.  If this is not possible we decline to vectorize.  The
analysis and code motion also takes into account that it doesn't introduce a RAW
dependency after the move of the stores.

This means that we check at the start of iterations whether we are going to exit
or not.  During the analyis phase we check whether we are allowed to do this
moving of statements.  Also note that we only move the scalar statements, but
only do so after peeling but just before we start transforming statements.

With this the vector flow no longer necessarily needs to match that of the
scalar code.  In addition most of the infrastructure is in place to support
general control flow safely, however we are punting this to GCC 15.

Codegen:

for e.g.

unsigned vect_a[N];
unsigned vect_b[N];

unsigned test4(unsigned x)
{
 unsigned ret = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;
   if (vect_a[i] > x)
     break;
   vect_a[i] = x;

 }
 return ret;
}

We generate for Adv. SIMD:

test4:
        adrp    x2, .LC0
        adrp    x3, .LANCHOR0
        dup     v2.4s, w0
        add     x3, x3, :lo12:.LANCHOR0
        movi    v4.4s, 0x4
        add     x4, x3, 3216
        ldr     q1, [x2, #:lo12:.LC0]
        mov     x1, 0
        mov     w2, 0
        .p2align 3,,7
.L3:
        ldr     q0, [x3, x1]
        add     v3.4s, v1.4s, v2.4s
        add     v1.4s, v1.4s, v4.4s
        cmhi    v0.4s, v0.4s, v2.4s
        umaxp   v0.4s, v0.4s, v0.4s
        fmov    x5, d0
        cbnz    x5, .L6
        add     w2, w2, 1
        str     q3, [x1, x4]
        str     q2, [x3, x1]
        add     x1, x1, 16
        cmp     w2, 200
        bne     .L3
        mov     w7, 3
.L2:
        lsl     w2, w2, 2
        add     x5, x3, 3216
        add     w6, w2, w0
        sxtw    x4, w2
        ldr     w1, [x3, x4, lsl 2]
        str     w6, [x5, x4, lsl 2]
        cmp     w0, w1
        bcc     .L4
        add     w1, w2, 1
        str     w0, [x3, x4, lsl 2]
        add     w6, w1, w0
        sxtw    x1, w1
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        add     w4, w2, 2
        str     w0, [x3, x1, lsl 2]
        sxtw    x1, w4
        add     w6, w1, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
        add     w2, w2, 3
        cmp     w7, 3
        beq     .L4
        sxtw    x1, w2
        add     w2, w2, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w2, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
.L4:
        mov     w0, 0
        ret
        .p2align 2,,3
.L6:
        mov     w7, 4
        b       .L2

and for SVE:

test4:
        adrp    x2, .LANCHOR0
        add     x2, x2, :lo12:.LANCHOR0
        add     x5, x2, 3216
        mov     x3, 0
        mov     w1, 0
        cntw    x4
        mov     z1.s, w0
        index   z0.s, #0, #1
        ptrue   p1.b, all
        ptrue   p0.s, all
        .p2align 3,,7
.L3:
        ld1w    z2.s, p1/z, [x2, x3, lsl 2]
        add     z3.s, z0.s, z1.s
        cmplo   p2.s, p0/z, z1.s, z2.s
        b.any   .L2
        st1w    z3.s, p1, [x5, x3, lsl 2]
        add     w1, w1, 1
        st1w    z1.s, p1, [x2, x3, lsl 2]
        add     x3, x3, x4
        incw    z0.s
        cmp     w3, 803
        bls     .L3
.L5:
        mov     w0, 0
        ret
        .p2align 2,,3
.L2:
        cntw    x5
        mul     w1, w1, w5
        cbz     w5, .L5
        sxtw    x1, w1
        sub     w5, w5, #1
        add     x5, x5, x1
        add     x6, x2, 3216
        b       .L6
        .p2align 2,,3
.L14:
        str     w0, [x2, x1, lsl 2]
        cmp     x1, x5
        beq     .L5
        mov     x1, x4
.L6:
        ldr     w3, [x2, x1, lsl 2]
        add     w4, w0, w1
        str     w4, [x6, x1, lsl 2]
        add     x4, x1, 1
        cmp     w0, w3
        bcs     .L14
        mov     w0, 0
        ret

On the workloads this work is based on we see between 2-3x performance uplift
using this patch.

Follow up plan:
 - Boolean vectorization has several shortcomings.  I've filed PR110223 with the
   bigger ones that cause vectorization to fail with this patch.
 - SLP support.  This is planned for GCC 15 as for majority of the cases build
   SLP itself fails.  This means I'll need to spend time in making this more
   robust first.  Additionally it requires:
     * Adding support for vectorizing CFG (gconds)
     * Support for CFG to differ between vector and scalar loops.
   Both of which would be disruptive to the tree and I suspect I'll be handling
   fallouts from this patch for a while.  So I plan to work on the surrounding
   building blocks first for the remainder of the year.

Additionally it also contains reduced cases from issues found running over
various codebases.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Also regtested with:
 -march=armv8.3-a+sve
 -march=armv8.3-a+nosve
 -march=armv9-a
 -mcpu=neoverse-v1
 -mcpu=neoverse-n2

Bootstrapped Regtested x86_64-pc-linux-gnu and no issues.
Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues.

gcc/ChangeLog:

	* tree-if-conv.cc (idx_within_array_bound): Expose.
	* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New.
	(vect_analyze_data_ref_dependences): Use it.
	* tree-vect-loop-manip.cc (vect_iv_increment_position): New.
	(vect_set_loop_controls_directly,
	vect_set_loop_condition_partial_vectors,
	vect_set_loop_condition_partial_vectors_avx512,
	vect_set_loop_condition_normal): Support multiple exits.
	(slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for
	multiple exits.
	(slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB
	count and instead look at loop shape.
	(vect_update_ivs_after_vectorizer): Drop asserts.
	(vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations.
	(vect_do_peeling): Support multiple exits.
	(vect_loop_versioning): Likewise.
	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise
	early_breaks.
	(vect_analyze_loop_form): Support loop flows with more than single BB
	loop body.
	(vect_create_loop_vinfo): Support niters analysis for multiple exits.
	(vect_analyze_loop): Likewise.
	(vect_get_vect_def): New.
	(vect_create_epilog_for_reduction): Support early exit reductions.
	(vectorizable_live_operation_1): New.
	(find_connected_edge): New.
	(vectorizable_live_operation): Support early exit live operations.
	(move_early_exit_stmts): New.
	(vect_transform_loop): Use it.
	* tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond.
	(vect_recog_bitfield_ref_pattern): Support gconds and bools.
	(vect_recog_gcond_pattern): New.
	(possible_vector_mask_operation_p): Support gcond masks.
	(vect_determine_mask_precision): Likewise.
	(vect_mark_pattern_stmts): Set gcond def type.
	(can_vectorize_live_stmts): Force early break inductions to be live.
	* tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for
	early breaks.
	(vect_mark_stmts_to_be_vectorized): Process gcond usage.
	(perm_mask_for_reverse): Expose.
	(vectorizable_comparison_1): New.
	(vectorizable_early_exit): New.
	(vect_analyze_stmt): Support early break and gcond.
	(vect_transform_stmt): Likewise.
	(vect_is_simple_use): Likewise.
	(vect_get_vector_types_for_stmt): Likewise.
	* tree-vectorizer.cc (pass_vectorize::execute): Update exits for value
	numbering.
	* tree-vectorizer.h (enum vect_def_type): Add vect_condition_def.
	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES,
	LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB,
	LOOP_VINFO_EARLY_BRK_VUSES): New.
	(is_loop_header_bb_p): Drop assert.
	(class loop): Add early_breaks, early_break_stores, early_break_dest_bb,
	early_break_vuses.
	(vect_iv_increment_position, perm_mask_for_reverse,
	ref_within_array_bound): New.
	(slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.
cooljeanius pushed a commit that referenced this pull request Feb 5, 2024
This patch adjusts the costs so that we treat REG and SUBREG expressions the
same for costing.

This was motivated by bt_skip_func and bt_find_func in xz and results in nearly
a 5% improvement in the dynamic instruction count for input #2 and smaller, but
definitely visible improvements pretty much across the board.  Exceptions would
be perlbench input #1 and exchange2 which showed very small regressions.

In the bt_find_func and bt_skip_func cases we have  something like this:

> (insn 10 7 11 2 (set (reg/v:DI 136 [ x ])
>         (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))) "zz.c":6:21 387 {*zero_extendsidi2_bitmanip}
>      (nil))
> (insn 11 10 12 2 (set (reg:DI 142 [ _1 ])
>         (plus:DI (reg/v:DI 136 [ x ])
>             (reg/v:DI 139 [ b ]))) "zz.c":7:23 5 {adddi3}
>      (nil))

[ ... ]> (insn 13 12 14 2 (set (reg:DI 143 [ _2 ])
>         (plus:DI (reg/v:DI 136 [ x ])
>             (reg/v:DI 141 [ c ]))) "zz.c":8:23 5 {adddi3}
>      (nil))

Note the two uses of (reg 136). The best way to handle that in combine might be
a 3->2 split.  But there's a much better approach if we look at fwprop...

(set (reg:DI 142 [ _1 ])
    (plus:DI (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))
        (reg/v:DI 139 [ b ])))
change not profitable (cost 4 -> cost 8)

So that should be the same cost as a regular DImode addition when the ZBA
extension is enabled.  But it ends up costing more because the clause to cost
this variant isn't prepared to handle a SUBREG.  That results in the RTL above
having too high a cost and fwprop gives up.

One approach would be to replace the REG_P  with REG_P || SUBREG_P in the
costing code.  I ultimately decided against that and instead check if the
operand in question passes register_operand.

By far the most important case to handle is the DImode PLUS.  But for the sake
of consistency, I changed the other instances in riscv_rtx_costs as well.  For
those other cases we're talking about improvements in the .000001% range.

While we are into stage4, this just hits cost modeling which we've generally
agreed is still appropriate (though we were mostly talking about vector).  So
I'm going to extend that general agreement ever so slightly and include scalar
cost modeling :-)

gcc/
	* config/riscv/riscv.cc (riscv_rtx_costs): Handle SUBREG and REG
	similarly.

gcc/testsuite/

	* gcc.target/riscv/reg_subreg_costs.c: New test.

	Co-authored-by: Jivan Hakobyan <[email protected]>
cooljeanius pushed a commit that referenced this pull request Apr 7, 2024
We evaluate constexpr functions on the original, pre-genericization bodies.
That means that the function body we're evaluating will not have gone
through cp_genericize_r's "Map block scope extern declarations to visible
declarations with the same name and type in outer scopes if any".  Here:

  constexpr bool bar() { return true; } // #1
  constexpr bool foo() {
    constexpr bool bar(void); // #2
    return bar();
  }

it means that we:
1) register_constexpr_fundef (#1)
2) cp_genericize (#1)
   nothing interesting happens
3) register_constexpr_fundef (foo)
   does copy_fn, so we have two copies of the BIND_EXPR
4) cp_genericize (foo)
   this remaps #2 to #1, but only on one copy of the BIND_EXPR
5) retrieve_constexpr_fundef (foo)
   we find it, no problem
6) retrieve_constexpr_fundef (#2)
   and here #2 isn't found in constexpr_fundef_table, because
   we're working on the BIND_EXPR copy where #2 wasn't mapped to #1
   so we fail.  We've only registered #1.

It should work to use DECL_LOCAL_DECL_ALIAS (which used to be
extern_decl_map).  We evaluate constexpr functions on pre-cp_fold
bodies to avoid diagnostic problems, but the remapping I'm proposing
should not interfere with diagnostics.

This is not a problem for a global scope redeclaration; there we go
through duplicate_decls which keeps the DECL_UID:
  DECL_UID (olddecl) = olddecl_uid;
and DECL_UID is what constexpr_fundef_hasher::hash uses.

	PR c++/111132

gcc/cp/ChangeLog:

	* constexpr.cc (get_function_named_in_call): Use
	cp_get_fndecl_from_callee.
	* cvt.cc (cp_get_fndecl_from_callee): If there's a
	DECL_LOCAL_DECL_ALIAS, use it.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/constexpr-redeclaration3.C: New test.
	* g++.dg/cpp0x/constexpr-redeclaration4.C: New test.
cooljeanius pushed a commit that referenced this pull request Apr 7, 2024
aarch64-sve.md had a pattern that combined:

	cmpeq	pb.T, pa/z, zc.T, #0
	mov	zd.T, pb/z, #1

into:

	cnot	zd.T, pa/m, zc.T

But this is only valid if pa.T is a ptrue.  In other cases, the
original would set inactive elements of zd.T to 0, whereas the
combined form would copy elements from zc.T.

gcc/
	PR target/114603
	* config/aarch64/aarch64-sve.md (@aarch64_pred_cnot<mode>): Replace
	with...
	(@aarch64_ptrue_cnot<mode>): ...this, requiring operand 1 to be
	a ptrue.
	(*cnot<mode>): Require operand 1 to be a ptrue.
	* config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand):
	Use aarch64_ptrue_cnot<mode> for _x operations that are predicated
	with a ptrue.  Represent other _x operations as fully-defined _m
	operations.

gcc/testsuite/
	PR target/114603
	* gcc.target/aarch64/sve/acle/general/cnot_1.c: New test.
cooljeanius pushed a commit that referenced this pull request Apr 22, 2024
…. [PR114741]

In PR114741 we see that we have a regression in codegen when SVE is enable where
the simple testcase:

void foo(unsigned v, unsigned *p)
{
    *p = v & 1;
}

generates

foo:
        fmov    s31, w0
        and     z31.s, z31.s, #1
        str     s31, [x1]
        ret

instead of:

foo:
        and     w0, w0, 1
        str     w0, [x1]
        ret

This causes an impact it not just codesize but also performance.  This is caused
by the use of the ^ constraint modifier in the pattern <optab><mode>3.

The documentation states that this modifier should only have an effect on the
alternative costing in that a particular alternative is to be preferred unless
a non-psuedo reload is needed.

The pattern was trying to convey that whenever both r and w are required, that
it should prefer r unless a reload is needed.  This is because if a reload is
needed then we can construct the constants more flexibly on the SIMD side.

We were using this so simplify the implementation and to get generic cases such
as:

double negabs (double x)
{
   unsigned long long y;
   memcpy (&y, &x, sizeof(double));
   y = y | (1UL << 63);
   memcpy (&x, &y, sizeof(double));
   return x;
}

which don't go through an expander.
However the implementation of ^ in the register allocator is not according to
the documentation in that it also has an effect during coloring.  During initial
register class selection it applies a penalty to a class, similar to how ? does.

In this example the penalty makes the use of GP regs expensive enough that it no
longer considers them:

    r106: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS
;;        3--> b  0: i   9 r106=r105&0x1
    :cortex_a53_slot_any:GENERAL_REGS+0(-1)FP_REGS+1(1)PR_LO_REGS+0(0)
                         PR_HI_REGS+0(0):model 4

which is not the expected behavior.  For GCC 14 this is a conservative fix.

1. we remove the ^ modifier from the logical optabs.

2. In order not to regress copysign we then move the copysign expansion to
   directly use the SIMD variant.  Since copysign only supports floating point
   modes this is fine and no longer relies on the register allocator to select
   the right alternative.

It once again regresses the general case, but this case wasn't optimized in
earlier GCCs either so it's not a regression in GCC 14.  This change gives
strict better codegen than earlier GCCs and still optimizes the important cases.

gcc/ChangeLog:

	PR target/114741
	* config/aarch64/aarch64.md (<optab><mode>3): Remove ^ from alt 2.
	(copysign<GPF:mode>3): Use SIMD version of IOR directly.

gcc/testsuite/ChangeLog:

	PR target/114741
	* gcc.target/aarch64/fneg-abs_2.c: Update codegen.
	* gcc.target/aarch64/fneg-abs_4.c: xfail for now.
	* gcc.target/aarch64/pr114741.c: New test.
cooljeanius pushed a commit that referenced this pull request Jun 10, 2024
The PR complains that

  void do_something(){
    #pragma GCC diagnostic push
    #pragma GCC diagnostic ignored "-Wunused-label"
    start:;
    #pragma GCC diagnostic pop
  } #1

doesn't work.  That's because we warn_for_unused_label only while we're
in finish_function, meaning we're at #1 where we're outside the #pragma
region.  We can use suppress_warning + warning_suppressed_p to fix this.

Note that I'm not using TREE_USED.  Propagating it in tsubst_stmt/LABEL_EXPR
from decl to label would mean that we don't warn in do_something2, but
I think we want the warning there: we're in a template and the goto is
a discarded statement.

	PR c++/113582

gcc/c-family/ChangeLog:

	* c-warn.cc (warn_for_unused_label): Don't warn if -Wunused-label has
	been suppressed for the label.

gcc/cp/ChangeLog:

	* parser.cc (cp_parser_label_for_labeled_statement): suppress_warning
	if it's not enabled at input_location.
	* pt.cc (tsubst_stmt): Call copy_warning.

gcc/testsuite/ChangeLog:

	* g++.dg/warn/Wunused-label-4.C: New test.
cooljeanius pushed a commit that referenced this pull request Jun 10, 2024
This patch would like to fix below format issue of trailing operator.

=== ERROR type #1: trailing operator (4 error(s)) ===
gcc/config/riscv/riscv-vector-builtins.cc:4641:39:  if ((exts &
RVV_REQUIRE_ELEN_FP_16) &&
gcc/config/riscv/riscv-vector-builtins.cc:4651:39:  if ((exts &
RVV_REQUIRE_ELEN_FP_32) &&
gcc/config/riscv/riscv-vector-builtins.cc:4661:39:  if ((exts &
RVV_REQUIRE_ELEN_FP_64) &&
gcc/config/riscv/riscv-vector-builtins.cc:4670:36:  if ((exts &
RVV_REQUIRE_ELEN_64) &&

Passed the ./contrib/check_GNU_style.sh for this patch,  and double
checked there is no other format issue of the original patch.

Committed as format change.

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins.cc
	(validate_instance_type_required_extensions): Remove the
	operator from the trailing and put it to new line.

Signed-off-by: Pan Li <[email protected]>
cooljeanius pushed a commit that referenced this pull request Jul 7, 2024
Here during overload resolution we have two strictly viable ambiguous
candidates #1 and #2, and two non-strictly viable candidates #3 and #4
which we hold on to ever since r14-6522.  These latter candidates have
an empty second arg conversion since the first arg conversion was deemed
bad, and this trips up joust when called on #3 and #4 which assumes all
arg conversions are there.

We can fix this by making joust robust to empty arg conversions, but in
this situation we shouldn't need to compare #3 and #4 at all given that
we have a strictly viable candidate.  To that end, this patch makes
tourney shortcut considering non-strictly viable candidates upon
encountering ambiguity between two strictly viable candidates (taking
advantage of the fact that the candidates list is sorted according to
viability via splice_viable).

	PR c++/115239

gcc/cp/ChangeLog:

	* call.cc (tourney): Don't consider a non-strictly viable
	candidate as the champ if there was ambiguity between two
	strictly viable candidates.

gcc/testsuite/ChangeLog:

	* g++.dg/overload/error7.C: New test.

Reviewed-by: Jason Merrill <[email protected]>
cooljeanius pushed a commit that referenced this pull request Jul 7, 2024
This patch improves GCC’s vectorization of __builtin_popcount for aarch64 target
by adding popcount patterns for vector modes besides QImode, i.e., HImode,
SImode and DImode.

With this patch, we now generate the following for V8HI:
  cnt     v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b

For V4HI, we generate:
  cnt     v1.8b, v0.8b
  uaddlp  v2.4h, v1.8b

For V4SI, we generate:
  cnt     v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h

For V4SI with TARGET_DOTPROD, we generate the following instead:
  movi    v0.4s, #0
  movi    v1.16b, #1
  cnt     v3.16b, v2.16b
  udot    v0.4s, v3.16b, v1.16b

For V2SI, we generate:
  cnt     v1.8b, v.8b
  uaddlp  v2.4h, v1.8b
  uaddlp  v3.2s, v2.4h

For V2SI with TARGET_DOTPROD, we generate the following instead:
  movi    v0.8b, #0
  movi    v1.8b, #1
  cnt     v3.8b, v2.8b
  udot    v0.2s, v3.8b, v1.8b

For V2DI, we generate:
  cnt     v1.16b, v.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h
  uaddlp  v4.2d, v3.4s

For V4SI with TARGET_DOTPROD, we generate the following instead:
  movi    v0.4s, #0
  movi    v1.16b, #1
  cnt     v3.16b, v2.16b
  udot    v0.4s, v3.16b, v1.16b
  uaddlp  v0.2d, v0.4s

	PR target/113859

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (aarch64_<su>addlp<mode>): Rename to...
	(@aarch64_<su>addlp<mode>): ... This.
	(popcount<mode>2): New define_expand.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/popcnt-udot.c: New test.
	* gcc.target/aarch64/popcnt-vec.c: New test.

Signed-off-by: Pengxuan Zheng <[email protected]>
cooljeanius pushed a commit that referenced this pull request Aug 13, 2024
On many cores, including Neoverse V2 the throughput of vector ADD
instructions is higher than vector shifts like SHL.  We can lean on that
to emit code like:
  add     v0.4s, v0.4s, v0.4s
instead of:
  shl     v0.4s, v0.4s, 1

LLVM already does this trick.
In RTL the code gets canonincalised from (plus x x) to (ashift x 1) so I
opted to instead do this at the final assembly printing stage, similar
to how we emit CMLT instead of SSHR elsewhere in the backend.

I'd like to also do this for SVE shifts, but those will have to be
separate patches.

Signed-off-by: Kyrylo Tkachov <[email protected]>

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(aarch64_simd_imm_shl<mode><vczle><vczbe>): Rewrite to new
	syntax.  Add =w,w,vs1 alternative.
	* config/aarch64/constraints.md (vs1): New constraint.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/advsimd_shl_add.c: New test.
cooljeanius pushed a commit that referenced this pull request Aug 28, 2024
This patch tweaks timode_scalar_chain::compute_convert_gain to better
reflect the expansion of V1TImode arithmetic right shifts by the i386
backend.  The comment "see ix86_expand_v1ti_ashiftrt" appears after
"case ASHIFTRT" in compute_convert_gain, and the changes below attempt
to better match the logic used there.

The original motivating example is:

__int128 m1;
void foo()
{
  m1 = (m1 << 8) >> 8;
}

which with -O2 -mavx2 we fail to convert to vector form due to the
inappropriate cost of the arithmetic right shift.

  Instruction gain -16 for     7: {r103:TI=r101:TI>>0x8;clobber flags:CC;}
  Total gain: -3
  Chain #1 conversion is not profitable

This is reporting that the ASHIFTRT is four instructions worse using
vectors than in scalar form, which is incorrect as the AVX2 expansion
of this shift only requires three instructions (and the scalar form
requires two).

With more accurate costs in timode_scalar_chain::compute_convert_gain
we now see (with -O2 -mavx2):

  Instruction gain -4 for     7: {r103:TI=r101:TI>>0x8;clobber flags:CC;}
  Total gain: 9
  Converting chain #1...

which results in:

foo:	vmovdqa m1(%rip), %xmm0
        vpslldq $1, %xmm0, %xmm0
        vpsrad  $8, %xmm0, %xmm1
        vpsrldq $1, %xmm0, %xmm0
        vpblendd        $7, %xmm0, %xmm1, %xmm0
        vmovdqa %xmm0, m1(%rip)
        ret

2024-08-25  Roger Sayle  <[email protected]>
	    Uros Bizjak  <[email protected]>

gcc/ChangeLog
	* config/i386/i386-features.cc (compute_convert_gain)
	<case ASHIFTRT>: Update to match ix86_expand_v1ti_ashiftrt.
cooljeanius pushed a commit that referenced this pull request Sep 20, 2024
…o_debug_section [PR116614]

cat abc.C
  #define A(n) struct T##n {} t##n;
  #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9)
  #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9)
  #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9)
  #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9)
  E(1) E(2) E(3)
  int main () { return 0; }
./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c
./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2
(not included in testsuite as it takes a while to compile) FAILs with
lto-wrapper: fatal error: Too many copied sections: Operation not supported
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

The following patch fixes that.  Most of the 64K+ section support for
reading and writing was already there years ago (and especially reading used
quite often already) and a further bug fixed in it in the PR104617 fix.

Yet, the fix isn't solely about removing the
  if (new_i - 1 >= SHN_LORESERVE)
    {
      *err = ENOTSUP;
      return "Too many copied sections";
    }
5 lines, the missing part was that the function only handled reading of
the .symtab_shndx section but not copying/updating of it.
If the result has less than 64K-epsilon sections, that actually wasn't
needed, but e.g. with -fdebug-types-section one can exceed that pretty
easily (reported to us on WebKitGtk build on ppc64le).
Updating the section is slightly more complicated, because it basically
needs to be done in lock step with updating the .symtab section, if one
doesn't need to use SHN_XINDEX in there, the section should (or should be
updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would
be overwise stored but couldn't fit.  But repeating due to that all the
symtab decisions what to discard and how to rewrite it would be ugly.

So, the patch instead emits the .symtab_shndx section (or sections) last
and prepares the content during the .symtab processing and in a second
pass when going just through .symtab_shndx sections just uses the saved
content.

2024-09-07  Jakub Jelinek  <[email protected]>

	PR lto/116614
	* simple-object-elf.c (SHN_COMMON): Align comment with neighbouring
	comments.
	(SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for
	consistency.
	(simple_object_elf_find_sections): Formatting fixes.
	(simple_object_elf_fetch_attributes): Likewise.
	(simple_object_elf_attributes_merge): Likewise.
	(simple_object_elf_start_write): Likewise.
	(simple_object_elf_write_ehdr): Likewise.
	(simple_object_elf_write_shdr): Likewise.
	(simple_object_elf_write_to_file): Likewise.
	(simple_object_elf_copy_lto_debug_section): Likewise.  Don't fail for
	new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy
	over .symtab_shndx sections, though emit those last and compute their
	section content when processing associated .symtab sections.  Handle
	simple_object_internal_read failure even in the .symtab_shndx reading
	case.
cooljeanius pushed a commit that referenced this pull request Sep 20, 2024
On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift
instructions like SHL have a throughput of 2. We can lean on that to emit code
like:
 add	z31.b, z31.b, z31.b
instead of:
 lsl	z31.b, z31.b, #1

The implementation of this change for SVE vectors is similar to a prior patch
<https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659958.html> that adds
the above functionality for Neon vectors.

Here, the machine descriptor pattern is split up to separately accommodate left
and right shifts, so we can specifically emit an add for all left shifts by 1.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR <[email protected]>

gcc/ChangeLog:

	* config/aarch64/aarch64-sve.md (*post_ra_v<optab><mode>3): Split pattern
	to accomodate left and right shifts separately.
	(*post_ra_v_ashl<mode>3): Matches left shifts with additional
	constraint to check for shifts by 1.
	(*post_ra_v_<optab><mode>3): Matches right shifts.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of lsl-1
	with corresponding add.
	* gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise.
	* gcc.target/aarch64/sve/adr_1.c: Likewise.
	* gcc.target/aarch64/sve/adr_6.c: Likewise.
	* gcc.target/aarch64/sve/cond_mla_7.c: Likewise.
	* gcc.target/aarch64/sve/cond_mla_8.c: Likewise.
	* gcc.target/aarch64/sve/shift_2.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_s64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_u8.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c: Likewise.
	* gcc.target/aarch64/sve/sve_shl_add.c: New test.
cooljeanius pushed a commit that referenced this pull request Sep 20, 2024
…on [PR113328]

SVE's INDEX instruction can be used to populate vectors by values starting from
"base" and incremented by "step" for each subsequent value. We can take
advantage of it to generate vector constants if TARGET_SVE is available and the
base and step values are within [-16, 15].

For example, with the following function:

typedef int v4si __attribute__ ((vector_size (16)));
v4si
f_v4si (void)
{
  return (v4si){ 0, 1, 2, 3 };
}

GCC currently generates:

f_v4si:
	adrp    x0, .LC4
	ldr     q0, [x0, #:lo12:.LC4]
	ret

.LC4:
	.word   0
	.word   1
	.word   2
	.word   3

With this patch, we generate an INDEX instruction instead if TARGET_SVE is
available.

f_v4si:
	index   z0.s, #0, #1
	ret

	PR target/113328

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_simd_valid_immediate): Improve
	handling of some ADVSIMD vectors by using SVE's INDEX if TARGET_SVE is
	available.
	(aarch64_output_simd_mov_immediate): Likewise.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/acle/general/dupq_1.c: Update test to use
	SVE's INDEX instruction.
	* gcc.target/aarch64/sve/acle/general/dupq_2.c: Likewise.
	* gcc.target/aarch64/sve/acle/general/dupq_3.c: Likewise.
	* gcc.target/aarch64/sve/acle/general/dupq_4.c: Likewise.
	* gcc.target/aarch64/sve/vec_init_3.c: New test.

Signed-off-by: Pengxuan Zheng <[email protected]>
cooljeanius pushed a commit that referenced this pull request Oct 21, 2024
Implement vddup and vidup using the new MVE builtins framework.

We generate better code because we take advantage of the two outputs
produced by the v[id]dup instructions.

For instance, before:
	ldr	r3, [r0]
	sub	r2, r3, #8
	str	r2, [r0]
	mov	r2, r3
	vddup.u16	q3, r2, #1

now:
	ldr	r2, [r0]
	vddup.u16	q3, r2, #1
	str	r2, [r0]

2024-08-21  Christophe Lyon  <[email protected]>

	gcc/
	* config/arm/arm-mve-builtins-base.cc (class viddup_impl): New.
	(vddup): New.
	(vidup): New.
	* config/arm/arm-mve-builtins-base.def (vddupq): New.
	(vidupq): New.
	* config/arm/arm-mve-builtins-base.h (vddupq): New.
	(vidupq): New.
	* config/arm/arm_mve.h (vddupq_m): Delete.
	(vddupq_u8): Delete.
	(vddupq_u32): Delete.
	(vddupq_u16): Delete.
	(vidupq_m): Delete.
	(vidupq_u8): Delete.
	(vidupq_u32): Delete.
	(vidupq_u16): Delete.
	(vddupq_x_u8): Delete.
	(vddupq_x_u16): Delete.
	(vddupq_x_u32): Delete.
	(vidupq_x_u8): Delete.
	(vidupq_x_u16): Delete.
	(vidupq_x_u32): Delete.
	(vddupq_m_n_u8): Delete.
	(vddupq_m_n_u32): Delete.
	(vddupq_m_n_u16): Delete.
	(vddupq_m_wb_u8): Delete.
	(vddupq_m_wb_u16): Delete.
	(vddupq_m_wb_u32): Delete.
	(vddupq_n_u8): Delete.
	(vddupq_n_u32): Delete.
	(vddupq_n_u16): Delete.
	(vddupq_wb_u8): Delete.
	(vddupq_wb_u16): Delete.
	(vddupq_wb_u32): Delete.
	(vidupq_m_n_u8): Delete.
	(vidupq_m_n_u32): Delete.
	(vidupq_m_n_u16): Delete.
	(vidupq_m_wb_u8): Delete.
	(vidupq_m_wb_u16): Delete.
	(vidupq_m_wb_u32): Delete.
	(vidupq_n_u8): Delete.
	(vidupq_n_u32): Delete.
	(vidupq_n_u16): Delete.
	(vidupq_wb_u8): Delete.
	(vidupq_wb_u16): Delete.
	(vidupq_wb_u32): Delete.
	(vddupq_x_n_u8): Delete.
	(vddupq_x_n_u16): Delete.
	(vddupq_x_n_u32): Delete.
	(vddupq_x_wb_u8): Delete.
	(vddupq_x_wb_u16): Delete.
	(vddupq_x_wb_u32): Delete.
	(vidupq_x_n_u8): Delete.
	(vidupq_x_n_u16): Delete.
	(vidupq_x_n_u32): Delete.
	(vidupq_x_wb_u8): Delete.
	(vidupq_x_wb_u16): Delete.
	(vidupq_x_wb_u32): Delete.
	(__arm_vddupq_m_n_u8): Delete.
	(__arm_vddupq_m_n_u32): Delete.
	(__arm_vddupq_m_n_u16): Delete.
	(__arm_vddupq_m_wb_u8): Delete.
	(__arm_vddupq_m_wb_u16): Delete.
	(__arm_vddupq_m_wb_u32): Delete.
	(__arm_vddupq_n_u8): Delete.
	(__arm_vddupq_n_u32): Delete.
	(__arm_vddupq_n_u16): Delete.
	(__arm_vidupq_m_n_u8): Delete.
	(__arm_vidupq_m_n_u32): Delete.
	(__arm_vidupq_m_n_u16): Delete.
	(__arm_vidupq_n_u8): Delete.
	(__arm_vidupq_m_wb_u8): Delete.
	(__arm_vidupq_m_wb_u16): Delete.
	(__arm_vidupq_m_wb_u32): Delete.
	(__arm_vidupq_n_u32): Delete.
	(__arm_vidupq_n_u16): Delete.
	(__arm_vidupq_wb_u8): Delete.
	(__arm_vidupq_wb_u16): Delete.
	(__arm_vidupq_wb_u32): Delete.
	(__arm_vddupq_wb_u8): Delete.
	(__arm_vddupq_wb_u16): Delete.
	(__arm_vddupq_wb_u32): Delete.
	(__arm_vddupq_x_n_u8): Delete.
	(__arm_vddupq_x_n_u16): Delete.
	(__arm_vddupq_x_n_u32): Delete.
	(__arm_vddupq_x_wb_u8): Delete.
	(__arm_vddupq_x_wb_u16): Delete.
	(__arm_vddupq_x_wb_u32): Delete.
	(__arm_vidupq_x_n_u8): Delete.
	(__arm_vidupq_x_n_u16): Delete.
	(__arm_vidupq_x_n_u32): Delete.
	(__arm_vidupq_x_wb_u8): Delete.
	(__arm_vidupq_x_wb_u16): Delete.
	(__arm_vidupq_x_wb_u32): Delete.
	(__arm_vddupq_m): Delete.
	(__arm_vddupq_u8): Delete.
	(__arm_vddupq_u32): Delete.
	(__arm_vddupq_u16): Delete.
	(__arm_vidupq_m): Delete.
	(__arm_vidupq_u8): Delete.
	(__arm_vidupq_u32): Delete.
	(__arm_vidupq_u16): Delete.
	(__arm_vddupq_x_u8): Delete.
	(__arm_vddupq_x_u16): Delete.
	(__arm_vddupq_x_u32): Delete.
	(__arm_vidupq_x_u8): Delete.
	(__arm_vidupq_x_u16): Delete.
	(__arm_vidupq_x_u32): Delete.
cooljeanius pushed a commit that referenced this pull request Oct 24, 2024
gcc.dg/torture/pr112305.c contains an inner loop that executes
0x8000_0014 times and an outer loop that executes 5 times, giving about
10 billion total executions of the inner loop body.  At -O2 and above we
are able to remove the inner loop, but at -O1 we keep a no-op loop:

        dls     lr, r3
.L3:
        subs    r3, r3, #1
        le      lr, .L3

and at -O0 we of course don't optimise.

This can lead to long execution times on simulators, possibly
triggering a timeout.

gcc/testsuite
	* gcc.dg/torture/pr112305.c: Skip at -O0 and -O1 for simulators.
cooljeanius pushed a commit that referenced this pull request Dec 22, 2024
vec.h has this method:

  template<typename T, typename A>
  inline T *
  vec_safe_push (vec<T, A, vl_embed> *&v, const T &obj CXX_MEM_STAT_INFO)

where v is a reference to a pointer to vec.  This matches the regex for
VecPrinter, so gdbhooks.py attempts to print it but chokes on the reference.
I see the following:

  #1  0x0000000002b84b7b in vec_safe_push<edge_def*, va_gc> (v=Traceback (most
  recent call last):
    File "$SRC/gcc/gcc/gdbhooks.py", line 486, in to_string
      return '0x%x' % intptr(self.gdbval)
    File "$SRC/gcc/gcc/gdbhooks.py", line 168, in intptr
      return long(gdbval) if sys.version_info.major == 2 else int(gdbval)
  gdb.error: Cannot convert value to long.

This patch makes VecPrinter handle such references by stripping them
(dereferencing) at the top of the relevant functions.

gcc/ChangeLog:

	* gdbhooks.py (strip_ref): New. Use it ...
	(VecPrinter.to_string): ... here,
	(VecPrinter.children): ... and here.
cooljeanius pushed a commit that referenced this pull request Dec 22, 2024
Brief:
The bug appears in LRA after rematerialization pass while creating live ranges.
File lra.cc:
*************************************************************
      /* Now we know what pseudos should be spilled.  Try to
	 rematerialize them first.  */
      if (lra_remat ())
	{
	  /* We need full live info -- see the comment above.  */
	  lra_create_live_ranges (lra_reg_spill_p, true);
*************************************************************
Wrong call `lra_create_live_ranges (lra_reg_spill_p, true)'
It have to be `lra_create_live_ranges (true, true)'.

The explanation:
**********************************
int main (void)
{
  if (a.u33 * a.u33 != 0)
------^^^^^^^^^^^^^
    goto abrt;
  if (a.u33 * a.u40 * a.u33 != 0)
**********************************
The bug appears here.

Part of the expression `a.u33 * a.u33'
Before LRA:
*************************************************************
(insn 13 11 15 2 (set (reg:QI 184 [ _1+3 ])
        (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2]  <var_decl 0x7c866435d000 a>)
                    (const_int 3 [0x3]))) [1 a+3 S1 A8])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 15 13 16 2 (set (reg:QI 64 [ a+4 ])
        (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2]  <var_decl 0x7c866435d000 a>)
                    (const_int 4 [0x4]))) [1 a+4 S1 A8])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 16 15 20 2 (set (reg:QI 185 [ _1+4 ])
        (zero_extract:QI (reg:QI 64 [ a+4 ])
            (const_int 1 [0x1])
            (const_int 0 [0]))) "bf.c":11:8 985 {*extzvqi_split}
     (nil))
*************************************************************

After LRA:
*************************************************************
(insn 587 11 13 2 (set (reg:QI 24 r24 [368])
        (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2]  <var_decl 0x7c866435d000 a>)
                    (const_int 3 [0x3]))) [1 a+3 S1 A8])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 13 587 15 2 (set (mem/c:QI (plus:HI (reg/f:HI 28 r28)
                (const_int 1 [0x1])) [4 %sfp+1 S1 A8])
        (reg:QI 24 r24 [368])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 15 13 16 2 (set (reg:QI 6 r6 [orig:64 a+4 ] [64])
        (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2]  <var_decl 0x7c866435d000 a>)
                    (const_int 4 [0x4]))) [1 a+4 S1 A8])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 16 15 572 2 (set (reg:QI 24 r24 [orig:185 _1+4 ] [185])
        (zero_extract:QI (reg:QI 6 r6 [orig:64 a+4 ] [64])
            (const_int 1 [0x1])
            (const_int 0 [0]))) "bf.c":11:8 985 {*extzvqi_split}
     (nil))
(insn 572 16 20 2 (set (mem/c:QI (plus:HI (reg/f:HI 28 r28)
                (const_int 1 [0x1])) [4 %sfp+1 S1 A8])
        (reg:QI 24 r24 [orig:185 _1+4 ] [185])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
*************************************************************
Insn 13 and insn 572 use sfp+1 as a spill slot, but in IRA pass it was a two
different pseudos r184 and r185.
Insns 13 use sfp+1 as a spill slot for r184
Insns 572 use the same slot for r185. It's wrong.

Here we have a rematerialization.

Fragment from bf.c.317r.reload:
**************************************************************************************
******** Rematerialization #1: ********

df_worklist_dataflow_doublequeue: n_basic_blocks 14 n_edges 18 count 14 (    1)
df_worklist_dataflow_doublequeue: n_basic_blocks 14 n_edges 18 count 14 (    1)

Cands:
0 (nop=0, remat_regno=185, reload_regno=359):
(insn 16 15 572 2 (set (reg:QI 359 [orig:185 _1+4 ] [185])
                    (zero_extract:QI (reg:QI 64 [ a+4 ])
                        (const_int 1 [0x1])
                        (const_int 0 [0]))) "bf.c":11:8 985 {*extzvqi_split}
                 (nil))

**************************************************************************************
[...]
**************************************************************************************
Ranges after the compression:
 r185: [0..1]
	   Frame pointer can not be eliminated anymore
	   Spilling non-eliminable hard regs: 28 29
	 Spilling r113(28)
	 Spilling r184(29)
	 Spilling r208(29)
	 Spilling r209(28)
  Slot 0 regnos (width = 0):	 185	 209	 208	 184	 113
**************************************************************************************

The bug is here: `r185: [0..1]' wrong live range after compression.
r185 and r184 can't have the same spill slot !

Rematerialization in bf.c.317r.reload looks like:
*************************************************************
   24: r14:QI=r185:QI
    Inserting rematerialization insn before:
  581: r14:QI=zero_extract(r64:QI,0x1,0)

deleting insn with uid = 24.
         Considering alt=0 of insn 16:   (0) =r  (1) rYil  (2) n
          overall=0,losers=0,rld_nregs=0
   32: r22:QI=r185:QI
    Inserting rematerialization insn before:
  582: r22:QI=zero_extract(r64:QI,0x1,0)

deleting insn with uid = 32.
*************************************************************

It's happened because:

Fragment from lra.c (lra):
*************************************************************************
      if (! live_p)
	{
	  /* We need full live info for spilling pseudos into
	     registers instead of memory.  */
	  lra_create_live_ranges (lra_reg_spill_p, true);
	  live_p = true;
	}
      /* We should check necessity for spilling here as the above live
	 range pass can remove spilled pseudos.  */
      if (! lra_need_for_spills_p ())
	break;
      /* Now we know what pseudos should be spilled.  Try to
	 rematerialize them first.  */
      if (lra_remat ())
	{
	  /* We need full live info -- see the comment above.  */
	  lra_create_live_ranges (lra_reg_spill_p, true);
----------------------------------^^^^^^^^^^^^^^^
	  live_p = true;
*************************************************************************

The bug is here.
Rematerialization sometimes can be like spilling pseudos into registers.
  582: r22:QI=zero_extract(r64:QI,0x1,0)

So, here we need a live ranges for all pseudos.

PS: the patch will not affect any target with usable definition of
    TARGET_SPILL_CLASS hook.

	PR target/116778
gcc/
	* lra-lives.cc (complete_info_p): Clarification of the comment.
	* lra.cc (lra): Create a full live info after rematerialization.
cooljeanius pushed a commit that referenced this pull request Dec 22, 2024
This PR reports a missed optimization.  When we have:

  Str str{"Test"};
  callback(str);

as in the test, we're able to evaluate the Str::Str() call at compile
time.  But when we have:

  callback(Str{"Test"});

we are not.  With this patch (in fact, it's Patrick's patch with a little
tweak), we turn

  callback (TARGET_EXPR <D.2890, <<< Unknown tree: aggr_init_expr
    5
    __ct_comp
    D.2890
    (struct Str *) <<< Unknown tree: void_cst >>>
    (const char *) "Test" >>>>)

into

  callback (TARGET_EXPR <D.2890, {.str=(const char *) "Test", .length=4}>)

I explored the idea of calling maybe_constant_value for the whole
TARGET_EXPR in cp_fold.  That has three problems:
- we can't always elide a TARGET_EXPR, so we'd have to make sure the
  result is also a TARGET_EXPR;
- the resulting TARGET_EXPR must have the same flags, otherwise Bad
  Things happen;
- getting a new slot is also problematic.  I've seen a test where we
  had "TARGET_EXPR<D.2680, ...>, D.2680", and folding the whole TARGET_EXPR
  would get us "TARGET_EXPR<D.2681, ...>", but since we don't see the outer
  D.2680, we can't replace it with D.2681, and things break.

With this patch, two tree-ssa tests regressed: pr78687.C and pr90883.C.

FAIL: g++.dg/tree-ssa/pr90883.C   scan-tree-dump dse1 "Deleted redundant store: .*.a = {}"
is easy.  Previously, we would call C::C, so .gimple has:

  D.2590 = {};
  C::C (&D.2590);
  D.2597 = D.2590;
  return D.2597;

Then .einline inlines the C::C call:

  D.2590 = {};
  D.2590.a = {}; // #1
  D.2590.b = 0;  // #2
  D.2597 = D.2590;
  D.2590 ={v} {CLOBBER(eos)};
  return D.2597;

then #2 is removed in .fre1, and #1 is removed in .dse1.  So the test
passes.  But with the patch, .gimple won't have that C::C call, so the
IL is of course going to look different.  The .optimized dump looks the
same though so there's no problem.

pr78687.C is XFAILed because the test passes with r15-5746 but not with
r15-5747 as well.  I opened <https://gcc.gnu.org/PR117971>.

	PR c++/116416

gcc/cp/ChangeLog:

	* cp-gimplify.cc (cp_fold_r) <case TARGET_EXPR>: Try to fold
	TARGET_EXPR_INITIAL and replace it with the folded result if
	it's TREE_CONSTANT.

gcc/testsuite/ChangeLog:

	* g++.dg/analyzer/pr97116.C: Adjust dg-message.
	* g++.dg/tree-ssa/pr78687.C: Add XFAIL.
	* g++.dg/tree-ssa/pr90883.C: Adjust dg-final.
	* g++.dg/cpp0x/constexpr-prvalue1.C: New test.
	* g++.dg/cpp1y/constexpr-prvalue1.C: New test.

Co-authored-by: Patrick Palka <[email protected]>
Reviewed-by: Jason Merrill <[email protected]>
cooljeanius pushed a commit that referenced this pull request Dec 22, 2024
With the changes in r15-1579-g792f97b44ff, the code used as "padding" in
the test case is optimized way. Prevent this optimization by forcing a
read of the volatile memory.
Also, validate that there is a far jump in the generated assembler.

Without this patch, the generated assembler is reduced to:
f3:
        cmp     r0, #0
        beq     .L1
        ldr     r4, .L6
.L1:
        bx      lr
.L7:
        .align  2
.L6:
        .word   g_0_1

With the patch, the generated assembler is:
f3:
        movs    r2, #1
        ldr     r3, .L6
        push    {lr}
        str     r2, [r3]
        cmp     r0, #0
        bne     .LCB10
        bl      .L1     @far jump
.LCB10:
        b       .L7
.L8:
        .align  2
.L6:
        .word   .LANCHOR0
.L7:
        str     r2, [r3]
        ...
        str     r2, [r3]
.L1:
        pop     {pc}

gcc/testsuite/ChangeLog:

	* gcc.target/arm/thumb1-far-jump-2.c: Write to volatile memmory
	in macro to avoid optimization.

Signed-off-by: Torbjörn SVENSSON <[email protected]>
cooljeanius pushed a commit that referenced this pull request Dec 22, 2024
On Cortex-M4, the code generated is:
     cmp     r0, r1
     itte    ne
     lslne   r0, r0, r1
     asrne   r0, r0, #1
     moveq   r0, r1
     add     r0, r0, r1
     bx      lr

On Cortex-M7, the code generated is:
     cmp     r0, r1
     beq     .L3
     lsls    r0, r0, r1
     asrs    r0, r0, #1
     add     r0, r0, r1
     bx      lr
.L3:
     mov     r0, r1
     add     r0, r0, r1
     bx      lr

As Cortex-M7 only allow maximum one conditional instruction, force
Cortex-M4 to have a stable test case.

gcc/testsuite/ChangeLog:

	* gcc.target/arm/thumb-ifcvt.c: Use -mtune=cortex-m4.

Signed-off-by: Torbjörn SVENSSON <[email protected]>
cooljeanius pushed a commit that referenced this pull request Dec 22, 2024
This crash started with my r12-7803 but I believe the problem lies
elsewhere.

build_vec_init has cleanup_flags whose purpose is -- if I grok this
correctly -- to avoid destructing an object multiple times.  Let's
say we are initializing an array of A.  Then we might end up in
a scenario similar to initlist-eh1.C:

  try
    {
      call A::A in a loop
      // #0
      try
        {
	  call a fn using the array
	}
      finally
	{
	  // #1
	  call A::~A in a loop
	}
    }
  catch
    {
      // #2
      call A::~A in a loop
    }

cleanup_flags makes us emit a statement like

  D.3048 = 2;

at #0 to disable performing the cleanup at #2, since #1 will take
care of the destruction of the array.

But if we are not emitting the loop because we can use a constant
initializer (and use a single { a, b, ...}), we shouldn't generate
the statement resetting the iterator to its initial value.  Otherwise
we crash in gimplify_var_or_parm_decl because it gets the stray decl
D.3048.

	PR c++/117985

gcc/cp/ChangeLog:

	* init.cc (build_vec_init): Pop CLEANUP_FLAGS if we're not
	generating the loop.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/initlist-array23.C: New test.
	* g++.dg/cpp0x/initlist-array24.C: New test.
cooljeanius pushed a commit that referenced this pull request Jan 14, 2025
This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
default. To that end, the function aarch64_use_new_vector_costs_p and its uses
were removed. To prevent costing vec_to_scalar operations with 0, as
described in
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
we adjusted vectorizable_store such that the variable n_adjacent_stores
also covers vec_to_scalar operations. This way vec_to_scalar operations
are not costed individually, but as a group.
As suggested by Richard Sandiford, the "known_ne" in the multilane-check
was replaced by "maybe_ne" in order to treat nunits==1+1X as a vector
rather than a scalar.

Two tests were adjusted due to changes in codegen. In both cases, the
old code performed loop unrolling once, but the new code does not:
Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
        cbz     w3, .L92
        mov     x4, 0
        uxtw    x3, w3
+       cntd    x5
+       whilelo p7.d, xzr, x3
+       mov     z29.s, w5
        mov     z31.s, w2
-       whilelo p6.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p7.d, xzr, x2
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
        .p2align 3,,7
 .L94:
-       ld1d    z27.d, p7/z, [x0, #1, mul vl]
-       ld1d    z28.d, p6/z, [x0]
-       movprfx z29, z31
-       mul     z29.s, p5/m, z29.s, z30.s
-       incw    x4
-       uunpklo z0.d, z29.s
-       uunpkhi z29.d, z29.s
-       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
-       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
-       add     z25.d, z28.d, z25.d
+       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
+       movprfx z28, z31
+       mul     z28.s, p6/m, z28.s, z30.s
+       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
        add     z26.d, z27.d, z26.d
-       st1d    z26.d, p7, [x0, #1, mul vl]
-       whilelo p7.d, x4, x2
-       st1d    z25.d, p6, [x0]
-       incw    z30.s
-       incb    x0, all, mul #2
-       whilelo p6.d, x4, x3
+       st1d    z26.d, p7, [x0, x4, lsl 3]
+       add     z30.s, z30.s, z29.s
+       incd    x4
+       whilelo p7.d, x4, x3
        b.any   .L94
 .L92:
        ret

Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
        cbz     w3, .L84
-       addvl   x5, x1, #1
        mov     x4, 0
        uxtw    x3, w3
-       mov     z31.s, w2
+       cntd    x5
        whilelo p7.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p6.d, xzr, x2
+       mov     z29.s, w5
+       mov     z31.s, w2
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
        .p2align 3,,7
 .L86:
-       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
-       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
-       movprfx z29, z30
-       mul     z29.s, p5/m, z29.s, z31.s
-       add     z28.d, z28.d, #1
-       uunpklo z26.d, z29.s
-       st1d    z28.d, p7, [x0, z26.d, lsl 3]
-       incw    x4
-       uunpkhi z29.d, z29.s
+       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
+       movprfx z28, z30
+       mul     z28.s, p6/m, z28.s, z31.s
        add     z27.d, z27.d, #1
-       whilelo p6.d, x4, x2
-       st1d    z27.d, p7, [x0, z29.d, lsl 3]
-       incw    z30.s
+       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
+       incd    x4
+       add     z30.s, z30.s, z29.s
        whilelo p7.d, x4, x3
        b.any   .L86
 .L84:
	ret

The patch was bootstrapped and tested on aarch64-linux-gnu, no
regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz <[email protected]>

gcc/
	* tree-vect-stmts.cc (vectorizable_store): Extend the use of
	n_adjacent_stores to also cover vec_to_scalar operations.
	* config/aarch64/aarch64-tuning-flags.def: Remove
	use_new_vector_costs as tuning option.
	* config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
	Remove.
	(aarch64_vector_costs::add_stmt_cost): Remove use of
	aarch64_use_new_vector_costs_p.
	(aarch64_vector_costs::finish_cost): Remove use of
	aarch64_use_new_vector_costs_p.
	* config/aarch64/tuning_models/cortexx925.h: Remove
	AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
	* config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
	* config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
	* config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
	* config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
	* config/aarch64/tuning_models/neoversen2.h: Likewise.
	* config/aarch64/tuning_models/neoversen3.h: Likewise.
	* config/aarch64/tuning_models/neoversev1.h: Likewise.
	* config/aarch64/tuning_models/neoversev2.h: Likewise.
	* config/aarch64/tuning_models/neoversev3.h: Likewise.
	* config/aarch64/tuning_models/neoversev3ae.h: Likewise.

gcc/testsuite/
	* gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
	* gcc.target/aarch64/sve/strided_store_2.c: Likewise.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants