Add GitHub CI #1

cooljeanius · 2023-09-10T07:44:00Z

This patch decreses one machine instruction from "single bit extraction with shifting" operation, and tries to eliminate the conditional branch if CST2_POW2 doesn't fit into signed 12 bits with the help of ifcvt optimization. /* example #1 */ int test0(int x) { return (x & 1048576) != 0 ? 1024 : 0; } extern int foo(void); int test1(void) { return (foo() & 1048576) != 0 ? 16777216 : 0; } ;; before test0: movi a9, 0x400 srai a2, a2, 10 and a2, a2, a9 ret.n test1: addi sp, sp, -16 s32i.n a0, sp, 12 call0 foo extui a2, a2, 20, 1 slli a2, a2, 20 beqz.n a2, .L2 movi.n a2, 1 slli a2, a2, 24 .L2: l32i.n a0, sp, 12 addi sp, sp, 16 ret.n ;; after test0: extui a2, a2, 20, 1 slli a2, a2, 10 ret.n test1: addi sp, sp, -16 s32i.n a0, sp, 12 call0 foo l32i.n a0, sp, 12 extui a2, a2, 20, 1 slli a2, a2, 24 addi sp, sp, 16 ret.n In addition, if the left shift amount ('exact_log2(CST2_POW2)') is between 1 through 3 and a either addition or subtraction with another register follows, emit a ADDX[248] or SUBX[248] machine instruction instead of separate left shift and add/subtract ones. /* example #2 */ int test2(int x, int y) { return ((x & 1048576) != 0 ? 4 : 0) + y; } int test3(int x, int y) { return ((x & 2) != 0 ? 8 : 0) - y; } ;; before test2: movi.n a9, 4 srai a2, a2, 18 and a2, a2, a9 add.n a2, a2, a3 ret.n test3: movi.n a9, 8 slli a2, a2, 2 and a2, a2, a9 sub a2, a2, a3 ret.n ;; after test2: extui a2, a2, 20, 1 addx4 a2, a2, a3 ret.n test3: extui a2, a2, 1, 1 subx8 a2, a2, a3 ret.n gcc/ChangeLog: * config/xtensa/predicates.md (addsub_operator): New. * config/xtensa/xtensa.md (*extzvsi-1bit_ashlsi3, *extzvsi-1bit_addsubx): New insn_and_split patterns. * config/xtensa/xtensa.cc (xtensa_rtx_costs): Add a special case about ifcvt 'noce_try_cmove()' to handle constant loads that do not fit into signed 12 bits in the patterns added above.

In plenty of image and video processing code it's common to modify pixel values by a widening operation and then scale them back into range by dividing by 255. This patch adds an named function to allow us to emit an optimized sequence when doing an unsigned division that is equivalent to: x = y / (2 ^ (bitsize (y)/2)-1) For SVE2 this means we generate for: void draw_bitmap1(uint8_t* restrict pixel, uint8_t level, int n) { for (int i = 0; i < (n & -16); i+=1) pixel[i] = (pixel[i] * level) / 0xff; } the following: mov z3.b, #1 .L3: ld1b z0.h, p0/z, [x0, x3] mul z0.h, p1/m, z0.h, z2.h addhnb z1.b, z0.h, z3.h addhnb z0.b, z0.h, z1.h st1b z0.h, p0, [x0, x3] inch x3 whilelo p0.h, w3, w2 b.any .L3 instead of: .L3: ld1b z0.h, p1/z, [x0, x3] mul z0.h, p0/m, z0.h, z1.h umulh z0.h, p0/m, z0.h, z2.h lsr z0.h, z0.h, #7 st1b z0.h, p1, [x0, x3] inch x3 whilelo p1.h, w3, w2 b.any .L3 Which results in significantly faster code. gcc/ChangeLog: * config/aarch64/aarch64-sve2.md (@aarch64_bitmask_udiv<mode>3): New. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve2/div-by-bitmask_1.c: New test.

A friend declaration can only have constraints if it is defined. If multiple instantiations of a class template define the same friend function signature, it's an error, but that shouldn't happen if it's constrained to only be declared in one instantiation. Currently we don't mangle requirements, so the foos all mangle the same and actually instantiating #1 will break, but for now we can test that they're considered distinct. gcc/cp/ChangeLog: * pt.cc (tsubst_friend_function): Check satisfaction. gcc/testsuite/ChangeLog: * g++.dg/cpp2a/concepts-friend11.C: New test.

To improve compile times, the C++ library could use compiler built-ins rather than implementing std::is_convertible (and _nothrow) as class templates. This patch adds the built-ins. We already have __is_constructible and __is_assignable, and the nothrow forms of those. Microsoft (and clang, for compatibility) also provide an alias called __is_convertible_to. I did not add it, but it would be trivial to do so. I noticed that our __is_assignable doesn't implement the "Access checks are performed as if from a context unrelated to either type" requirement, therefore std::is_assignable / __is_assignable give two different results here: class S { operator int(); friend void g(); // #1 }; void g () { // #1 doesn't matter static_assert(std::is_assignable<int&, S>::value, ""); static_assert(__is_assignable(int&, S), ""); } This is not a problem if __is_assignable is not meant to be used by the users. This patch doesn't make libstdc++ use the new built-ins, but I had to rename a class otherwise its name would clash with the new built-in. PR c++/106784 gcc/c-family/ChangeLog: * c-common.cc (c_common_reswords): Add __is_convertible and __is_nothrow_convertible. * c-common.h (enum rid): Add RID_IS_CONVERTIBLE and RID_IS_NOTHROW_CONVERTIBLE. gcc/cp/ChangeLog: * constraint.cc (diagnose_trait_expr): Handle CPTK_IS_CONVERTIBLE and CPTK_IS_NOTHROW_CONVERTIBLE. * cp-objcp-common.cc (names_builtin_p): Handle RID_IS_CONVERTIBLE RID_IS_NOTHROW_CONVERTIBLE. * cp-tree.h (enum cp_trait_kind): Add CPTK_IS_CONVERTIBLE and CPTK_IS_NOTHROW_CONVERTIBLE. (is_convertible): Declare. (is_nothrow_convertible): Likewise. * cxx-pretty-print.cc (pp_cxx_trait_expression): Handle CPTK_IS_CONVERTIBLE and CPTK_IS_NOTHROW_CONVERTIBLE. * method.cc (is_convertible): New. (is_nothrow_convertible): Likewise. * parser.cc (cp_parser_primary_expression): Handle RID_IS_CONVERTIBLE and RID_IS_NOTHROW_CONVERTIBLE. (cp_parser_trait_expr): Likewise. * semantics.cc (trait_expr_value): Handle CPTK_IS_CONVERTIBLE and CPTK_IS_NOTHROW_CONVERTIBLE. (finish_trait_expr): Likewise. libstdc++-v3/ChangeLog: * include/std/type_traits: Rename __is_nothrow_convertible to __is_nothrow_convertible_lib. * testsuite/20_util/is_nothrow_convertible/value_ext.cc: Likewise. gcc/testsuite/ChangeLog: * g++.dg/ext/has-builtin-1.C: Enhance to test __is_convertible and __is_nothrow_convertible. * g++.dg/ext/is_convertible1.C: New test. * g++.dg/ext/is_convertible2.C: New test. * g++.dg/ext/is_nothrow_convertible1.C: New test. * g++.dg/ext/is_nothrow_convertible2.C: New test.

This patch is my proposed solution to PR rtl-optimization/91865. Normally RTX simplification canonicalizes a ZERO_EXTEND of a ZERO_EXTEND to a single ZERO_EXTEND, but as shown in this PR it is possible for combine's make_compound_operation to unintentionally generate a non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is unlikely to be matched by the backend. For the new test case: const int table[2] = {1, 2}; int foo (char i) { return table[i]; } compiling with -O2 -mlarge on msp430 we currently see: Trying 2 -> 7: 2: r25:HI=zero_extend(R12:QI) REG_DEAD R12:QI 7: r28:PSI=sign_extend(r25:HI)#0 REG_DEAD r25:HI Failed to match this instruction: (set (reg:PSI 28 [ iD.1772 ]) (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ])))) which results in the following code: foo: AND #0xff, R12 RLAM.A #4, R12 { RRAM.A #4, R12 RLAM.A #1, R12 MOVX.W table(R12), R12 RETA With this patch, we now see: Trying 2 -> 7: 2: r25:HI=zero_extend(R12:QI) REG_DEAD R12:QI 7: r28:PSI=sign_extend(r25:HI)#0 REG_DEAD r25:HI Successfully matched this instruction: (set (reg:PSI 28 [ iD.1772 ]) (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ]))) allowing combination of insns 2 and 7 original costs 4 + 8 = 12 replacement cost 8 foo: MOV.B R12, R12 RLAM.A #1, R12 MOVX.W table(R12), R12 RETA 2023-10-26 Roger Sayle <[email protected]> Richard Biener <[email protected]> gcc/ChangeLog PR rtl-optimization/91865 * combine.cc (make_compound_operation): Avoid creating a ZERO_EXTEND of a ZERO_EXTEND. gcc/testsuite/ChangeLog PR rtl-optimization/91865 * gcc.target/msp430/pr91865.c: New test case.

Since the last import from upstream libsanitizer, the output has changed and now looks more like this: READ of size 6 at 0x7ff7beb2a144 thread T0 #0 0x101cf7796 in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long) sanitizer_common_interceptors.inc:813 #1 0x101cf7b99 in memcmp sanitizer_common_interceptors.inc:840 #2 0x108a0c39f in __stack_chk_guard+0xf (dyld:x86_64+0x8039f) so let's adjust the pattern accordingly. gcc/testsuite/ChangeLog: * c-c++-common/asan/memcmp-1.c: Adjust pattern on darwin.

During partial ordering, we want to look through dependent alias template specializations within template arguments and otherwise treat them as opaque in other contexts (see e.g. r7-7116-g0c942f3edab108 and r11-7011-g6e0a231a4aa240). To that end template_args_equal was given a partial_order flag that controls this behavior. This flag does the right thing when a dependent alias template specialization appears as template argument of the partial specialization, e.g. in template<class T, class...> using first_t = T; template<class T> struct traits; template<class T> struct traits<first_t<T, T&>> { }; // #1 template<class T> struct traits<first_t<const T, T&>> { }; // #2 we correctly consider #2 to be more specialized than #1. But if the alias specialization appears as a nested template argument of another class template specialization, e.g. in template<class T> struct traits<A<first_t<T, T&>>> { }; // #1 template<class T> struct traits<A<first_t<const T, T&>>> { }; // #2 then we incorrectly consider #1 and #2 to be unordered. This is because 1. we don't propagate the flag to recursive template_args_equal calls 2. we don't use structural equality for class template specializations written in terms of dependent alias template specializations This patch fixes the first issue by turning the partial_order flag into a global. This patch fixes the second issue by making us propagate structural equality appropriately when building a class template specialization. In passing this patch also improves hashing of specializations that use structural equality. PR c++/90679 gcc/cp/ChangeLog: * cp-tree.h (comp_template_args): Remove partial_order parameter. (template_args_equal): Likewise. * pt.cc (comparing_for_partial_ordering): New global flag. (iterative_hash_template_arg) <case tcc_type>: Hash the template and arguments for specializations that use structural equality. (template_args_equal): Remove partial order parameter and use comparing_for_partial_ordering instead. (comp_template_args): Likewise. (comp_template_args_porder): Set comparing_for_partial_ordering instead. Make static. (any_template_arguments_need_structural_equality_p): Return true for an argument that's a dependent alias template specialization or a class template specialization that itself needs structural equality. * tree.cc (cp_tree_equal) <case TREE_VEC>: Adjust call to comp_template_args. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/alias-decl-75a.C: New test. * g++.dg/cpp0x/alias-decl-75b.C: New test.

Hi All, This patch adds initial support for early break vectorization in GCC. In other words it implements support for vectorization of loops with multiple exits. The support is added for any target that implements a vector cbranch optab, this includes both fully masked and non-masked targets. Depending on the operation, the vectorizer may also require support for boolean mask reductions using Inclusive OR/Bitwise AND. This is however only checked then the comparison would produce multiple statements. This also fully decouples the vectorizer's notion of exit from the existing loop infrastructure's exit. Before this patch the vectorizer always picked the natural loop latch connected exit as the main exit. After this patch the vectorizer is free to choose any exit it deems appropriate as the main exit. This means that even if the main exit is not countable (i.e. the termination condition could not be determined) we might still be able to vectorize should one of the other exits be countable. In such situations the loop is reflowed which enabled vectorization of many other loop forms. Concretely the kind of loops supported are of the forms: for (int i = 0; i < N; i++) { <statements1> if (<condition>) { ... <action>; } <statements2> } where <action> can be: - break - return - goto Any number of statements can be used before the <action> occurs. Since this is an initial version for GCC 14 it has the following limitations and features: - Only fixed sized iterations and buffers are supported. That is to say any vectors loaded or stored must be to statically allocated arrays with known sizes. N must also be known. This limitation is because our primary target for this optimization is SVE. For VLA SVE we can't easily do cross page iteraion checks. The result is likely to also not be beneficial. For that reason we punt support for variable buffers till we have First-Faulting support in GCC 15. - any stores in <statements1> should not be to the same objects as in <condition>. Loads are fine as long as they don't have the possibility to alias. More concretely, we block RAW dependencies when the intermediate value can't be separated fromt the store, or the store itself can't be moved. - Prologue peeling, alignment peelinig and loop versioning are supported. - Fully masked loops, unmasked loops and partially masked loops are supported - Any number of loop early exits are supported. - No support for epilogue vectorization. The only epilogue supported is the scalar final one. Peeling code supports it but the code motion code cannot find instructions to make the move in the epilog. - Early breaks are only supported for inner loop vectorization. With the help of IPA and LTO this still gets hit quite often. During bootstrap it hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now since these are tests for support for early exit vectorization. This implementation does not support completely handling the early break inside the vector loop itself but instead supports adding checks such that if we know that we have to exit in the current iteration then we branch to scalar code to actually do the final VF iterations which handles all the code in <action>. For the scalar loop we know that whatever exit you take you have to perform at most VF iterations. For vector code we only case about the state of fully performed iteration and reset the scalar code to the (partially) remaining loop. That is to say, the first vector loop executes so long as the early exit isn't needed. Once the exit is taken, the scalar code will perform at most VF extra iterations. The exact number depending on peeling and iteration start and which exit was taken (natural or early). For this scalar loop, all early exits are treated the same. When we vectorize we move any statement not related to the early break itself and that would be incorrect to execute before the break (i.e. has side effects) to after the break. If this is not possible we decline to vectorize. The analysis and code motion also takes into account that it doesn't introduce a RAW dependency after the move of the stores. This means that we check at the start of iterations whether we are going to exit or not. During the analyis phase we check whether we are allowed to do this moving of statements. Also note that we only move the scalar statements, but only do so after peeling but just before we start transforming statements. With this the vector flow no longer necessarily needs to match that of the scalar code. In addition most of the infrastructure is in place to support general control flow safely, however we are punting this to GCC 15. Codegen: for e.g. unsigned vect_a[N]; unsigned vect_b[N]; unsigned test4(unsigned x) { unsigned ret = 0; for (int i = 0; i < N; i++) { vect_b[i] = x + i; if (vect_a[i] > x) break; vect_a[i] = x; } return ret; } We generate for Adv. SIMD: test4: adrp x2, .LC0 adrp x3, .LANCHOR0 dup v2.4s, w0 add x3, x3, :lo12:.LANCHOR0 movi v4.4s, 0x4 add x4, x3, 3216 ldr q1, [x2, #:lo12:.LC0] mov x1, 0 mov w2, 0 .p2align 3,,7 .L3: ldr q0, [x3, x1] add v3.4s, v1.4s, v2.4s add v1.4s, v1.4s, v4.4s cmhi v0.4s, v0.4s, v2.4s umaxp v0.4s, v0.4s, v0.4s fmov x5, d0 cbnz x5, .L6 add w2, w2, 1 str q3, [x1, x4] str q2, [x3, x1] add x1, x1, 16 cmp w2, 200 bne .L3 mov w7, 3 .L2: lsl w2, w2, 2 add x5, x3, 3216 add w6, w2, w0 sxtw x4, w2 ldr w1, [x3, x4, lsl 2] str w6, [x5, x4, lsl 2] cmp w0, w1 bcc .L4 add w1, w2, 1 str w0, [x3, x4, lsl 2] add w6, w1, w0 sxtw x1, w1 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 add w4, w2, 2 str w0, [x3, x1, lsl 2] sxtw x1, w4 add w6, w1, w0 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] add w2, w2, 3 cmp w7, 3 beq .L4 sxtw x1, w2 add w2, w2, w0 ldr w4, [x3, x1, lsl 2] str w2, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] .L4: mov w0, 0 ret .p2align 2,,3 .L6: mov w7, 4 b .L2 and for SVE: test4: adrp x2, .LANCHOR0 add x2, x2, :lo12:.LANCHOR0 add x5, x2, 3216 mov x3, 0 mov w1, 0 cntw x4 mov z1.s, w0 index z0.s, #0, #1 ptrue p1.b, all ptrue p0.s, all .p2align 3,,7 .L3: ld1w z2.s, p1/z, [x2, x3, lsl 2] add z3.s, z0.s, z1.s cmplo p2.s, p0/z, z1.s, z2.s b.any .L2 st1w z3.s, p1, [x5, x3, lsl 2] add w1, w1, 1 st1w z1.s, p1, [x2, x3, lsl 2] add x3, x3, x4 incw z0.s cmp w3, 803 bls .L3 .L5: mov w0, 0 ret .p2align 2,,3 .L2: cntw x5 mul w1, w1, w5 cbz w5, .L5 sxtw x1, w1 sub w5, w5, #1 add x5, x5, x1 add x6, x2, 3216 b .L6 .p2align 2,,3 .L14: str w0, [x2, x1, lsl 2] cmp x1, x5 beq .L5 mov x1, x4 .L6: ldr w3, [x2, x1, lsl 2] add w4, w0, w1 str w4, [x6, x1, lsl 2] add x4, x1, 1 cmp w0, w3 bcs .L14 mov w0, 0 ret On the workloads this work is based on we see between 2-3x performance uplift using this patch. Follow up plan: - Boolean vectorization has several shortcomings. I've filed PR110223 with the bigger ones that cause vectorization to fail with this patch. - SLP support. This is planned for GCC 15 as for majority of the cases build SLP itself fails. This means I'll need to spend time in making this more robust first. Additionally it requires: * Adding support for vectorizing CFG (gconds) * Support for CFG to differ between vector and scalar loops. Both of which would be disruptive to the tree and I suspect I'll be handling fallouts from this patch for a while. So I plan to work on the surrounding building blocks first for the remainder of the year. Additionally it also contains reduced cases from issues found running over various codebases. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Also regtested with: -march=armv8.3-a+sve -march=armv8.3-a+nosve -march=armv9-a -mcpu=neoverse-v1 -mcpu=neoverse-n2 Bootstrapped Regtested x86_64-pc-linux-gnu and no issues. Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues. gcc/ChangeLog: * tree-if-conv.cc (idx_within_array_bound): Expose. * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New. (vect_analyze_data_ref_dependences): Use it. * tree-vect-loop-manip.cc (vect_iv_increment_position): New. (vect_set_loop_controls_directly, vect_set_loop_condition_partial_vectors, vect_set_loop_condition_partial_vectors_avx512, vect_set_loop_condition_normal): Support multiple exits. (slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for multiple exits. (slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB count and instead look at loop shape. (vect_update_ivs_after_vectorizer): Drop asserts. (vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations. (vect_do_peeling): Support multiple exits. (vect_loop_versioning): Likewise. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise early_breaks. (vect_analyze_loop_form): Support loop flows with more than single BB loop body. (vect_create_loop_vinfo): Support niters analysis for multiple exits. (vect_analyze_loop): Likewise. (vect_get_vect_def): New. (vect_create_epilog_for_reduction): Support early exit reductions. (vectorizable_live_operation_1): New. (find_connected_edge): New. (vectorizable_live_operation): Support early exit live operations. (move_early_exit_stmts): New. (vect_transform_loop): Use it. * tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond. (vect_recog_bitfield_ref_pattern): Support gconds and bools. (vect_recog_gcond_pattern): New. (possible_vector_mask_operation_p): Support gcond masks. (vect_determine_mask_precision): Likewise. (vect_mark_pattern_stmts): Set gcond def type. (can_vectorize_live_stmts): Force early break inductions to be live. * tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for early breaks. (vect_mark_stmts_to_be_vectorized): Process gcond usage. (perm_mask_for_reverse): Expose. (vectorizable_comparison_1): New. (vectorizable_early_exit): New. (vect_analyze_stmt): Support early break and gcond. (vect_transform_stmt): Likewise. (vect_is_simple_use): Likewise. (vect_get_vector_types_for_stmt): Likewise. * tree-vectorizer.cc (pass_vectorize::execute): Update exits for value numbering. * tree-vectorizer.h (enum vect_def_type): Add vect_condition_def. (LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES, LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB, LOOP_VINFO_EARLY_BRK_VUSES): New. (is_loop_header_bb_p): Drop assert. (class loop): Add early_breaks, early_break_stores, early_break_dest_bb, early_break_vuses. (vect_iv_increment_position, perm_mask_for_reverse, ref_within_array_bound): New. (slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.

This patch adjusts the costs so that we treat REG and SUBREG expressions the same for costing. This was motivated by bt_skip_func and bt_find_func in xz and results in nearly a 5% improvement in the dynamic instruction count for input #2 and smaller, but definitely visible improvements pretty much across the board. Exceptions would be perlbench input #1 and exchange2 which showed very small regressions. In the bt_find_func and bt_skip_func cases we have something like this: > (insn 10 7 11 2 (set (reg/v:DI 136 [ x ]) > (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))) "zz.c":6:21 387 {*zero_extendsidi2_bitmanip} > (nil)) > (insn 11 10 12 2 (set (reg:DI 142 [ _1 ]) > (plus:DI (reg/v:DI 136 [ x ]) > (reg/v:DI 139 [ b ]))) "zz.c":7:23 5 {adddi3} > (nil)) [ ... ]> (insn 13 12 14 2 (set (reg:DI 143 [ _2 ]) > (plus:DI (reg/v:DI 136 [ x ]) > (reg/v:DI 141 [ c ]))) "zz.c":8:23 5 {adddi3} > (nil)) Note the two uses of (reg 136). The best way to handle that in combine might be a 3->2 split. But there's a much better approach if we look at fwprop... (set (reg:DI 142 [ _1 ]) (plus:DI (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0)) (reg/v:DI 139 [ b ]))) change not profitable (cost 4 -> cost 8) So that should be the same cost as a regular DImode addition when the ZBA extension is enabled. But it ends up costing more because the clause to cost this variant isn't prepared to handle a SUBREG. That results in the RTL above having too high a cost and fwprop gives up. One approach would be to replace the REG_P with REG_P || SUBREG_P in the costing code. I ultimately decided against that and instead check if the operand in question passes register_operand. By far the most important case to handle is the DImode PLUS. But for the sake of consistency, I changed the other instances in riscv_rtx_costs as well. For those other cases we're talking about improvements in the .000001% range. While we are into stage4, this just hits cost modeling which we've generally agreed is still appropriate (though we were mostly talking about vector). So I'm going to extend that general agreement ever so slightly and include scalar cost modeling :-) gcc/ * config/riscv/riscv.cc (riscv_rtx_costs): Handle SUBREG and REG similarly. gcc/testsuite/ * gcc.target/riscv/reg_subreg_costs.c: New test. Co-authored-by: Jivan Hakobyan <[email protected]>

We evaluate constexpr functions on the original, pre-genericization bodies. That means that the function body we're evaluating will not have gone through cp_genericize_r's "Map block scope extern declarations to visible declarations with the same name and type in outer scopes if any". Here: constexpr bool bar() { return true; } // #1 constexpr bool foo() { constexpr bool bar(void); // #2 return bar(); } it means that we: 1) register_constexpr_fundef (#1) 2) cp_genericize (#1) nothing interesting happens 3) register_constexpr_fundef (foo) does copy_fn, so we have two copies of the BIND_EXPR 4) cp_genericize (foo) this remaps #2 to #1, but only on one copy of the BIND_EXPR 5) retrieve_constexpr_fundef (foo) we find it, no problem 6) retrieve_constexpr_fundef (#2) and here #2 isn't found in constexpr_fundef_table, because we're working on the BIND_EXPR copy where #2 wasn't mapped to #1 so we fail. We've only registered #1. It should work to use DECL_LOCAL_DECL_ALIAS (which used to be extern_decl_map). We evaluate constexpr functions on pre-cp_fold bodies to avoid diagnostic problems, but the remapping I'm proposing should not interfere with diagnostics. This is not a problem for a global scope redeclaration; there we go through duplicate_decls which keeps the DECL_UID: DECL_UID (olddecl) = olddecl_uid; and DECL_UID is what constexpr_fundef_hasher::hash uses. PR c++/111132 gcc/cp/ChangeLog: * constexpr.cc (get_function_named_in_call): Use cp_get_fndecl_from_callee. * cvt.cc (cp_get_fndecl_from_callee): If there's a DECL_LOCAL_DECL_ALIAS, use it. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/constexpr-redeclaration3.C: New test. * g++.dg/cpp0x/constexpr-redeclaration4.C: New test.

aarch64-sve.md had a pattern that combined: cmpeq pb.T, pa/z, zc.T, #0 mov zd.T, pb/z, #1 into: cnot zd.T, pa/m, zc.T But this is only valid if pa.T is a ptrue. In other cases, the original would set inactive elements of zd.T to 0, whereas the combined form would copy elements from zc.T. gcc/ PR target/114603 * config/aarch64/aarch64-sve.md (@aarch64_pred_cnot<mode>): Replace with... (@aarch64_ptrue_cnot<mode>): ...this, requiring operand 1 to be a ptrue. (*cnot<mode>): Require operand 1 to be a ptrue. * config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand): Use aarch64_ptrue_cnot<mode> for _x operations that are predicated with a ptrue. Represent other _x operations as fully-defined _m operations. gcc/testsuite/ PR target/114603 * gcc.target/aarch64/sve/acle/general/cnot_1.c: New test.

…. [PR114741] In PR114741 we see that we have a regression in codegen when SVE is enable where the simple testcase: void foo(unsigned v, unsigned *p) { *p = v & 1; } generates foo: fmov s31, w0 and z31.s, z31.s, #1 str s31, [x1] ret instead of: foo: and w0, w0, 1 str w0, [x1] ret This causes an impact it not just codesize but also performance. This is caused by the use of the ^ constraint modifier in the pattern <optab><mode>3. The documentation states that this modifier should only have an effect on the alternative costing in that a particular alternative is to be preferred unless a non-psuedo reload is needed. The pattern was trying to convey that whenever both r and w are required, that it should prefer r unless a reload is needed. This is because if a reload is needed then we can construct the constants more flexibly on the SIMD side. We were using this so simplify the implementation and to get generic cases such as: double negabs (double x) { unsigned long long y; memcpy (&y, &x, sizeof(double)); y = y | (1UL << 63); memcpy (&x, &y, sizeof(double)); return x; } which don't go through an expander. However the implementation of ^ in the register allocator is not according to the documentation in that it also has an effect during coloring. During initial register class selection it applies a penalty to a class, similar to how ? does. In this example the penalty makes the use of GP regs expensive enough that it no longer considers them: r106: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS ;; 3--> b 0: i 9 r106=r105&0x1 :cortex_a53_slot_any:GENERAL_REGS+0(-1)FP_REGS+1(1)PR_LO_REGS+0(0) PR_HI_REGS+0(0):model 4 which is not the expected behavior. For GCC 14 this is a conservative fix. 1. we remove the ^ modifier from the logical optabs. 2. In order not to regress copysign we then move the copysign expansion to directly use the SIMD variant. Since copysign only supports floating point modes this is fine and no longer relies on the register allocator to select the right alternative. It once again regresses the general case, but this case wasn't optimized in earlier GCCs either so it's not a regression in GCC 14. This change gives strict better codegen than earlier GCCs and still optimizes the important cases. gcc/ChangeLog: PR target/114741 * config/aarch64/aarch64.md (<optab><mode>3): Remove ^ from alt 2. (copysign<GPF:mode>3): Use SIMD version of IOR directly. gcc/testsuite/ChangeLog: PR target/114741 * gcc.target/aarch64/fneg-abs_2.c: Update codegen. * gcc.target/aarch64/fneg-abs_4.c: xfail for now. * gcc.target/aarch64/pr114741.c: New test.

The PR complains that void do_something(){ #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wunused-label" start:; #pragma GCC diagnostic pop } #1 doesn't work. That's because we warn_for_unused_label only while we're in finish_function, meaning we're at #1 where we're outside the #pragma region. We can use suppress_warning + warning_suppressed_p to fix this. Note that I'm not using TREE_USED. Propagating it in tsubst_stmt/LABEL_EXPR from decl to label would mean that we don't warn in do_something2, but I think we want the warning there: we're in a template and the goto is a discarded statement. PR c++/113582 gcc/c-family/ChangeLog: * c-warn.cc (warn_for_unused_label): Don't warn if -Wunused-label has been suppressed for the label. gcc/cp/ChangeLog: * parser.cc (cp_parser_label_for_labeled_statement): suppress_warning if it's not enabled at input_location. * pt.cc (tsubst_stmt): Call copy_warning. gcc/testsuite/ChangeLog: * g++.dg/warn/Wunused-label-4.C: New test.

This patch would like to fix below format issue of trailing operator. === ERROR type #1: trailing operator (4 error(s)) === gcc/config/riscv/riscv-vector-builtins.cc:4641:39: if ((exts & RVV_REQUIRE_ELEN_FP_16) && gcc/config/riscv/riscv-vector-builtins.cc:4651:39: if ((exts & RVV_REQUIRE_ELEN_FP_32) && gcc/config/riscv/riscv-vector-builtins.cc:4661:39: if ((exts & RVV_REQUIRE_ELEN_FP_64) && gcc/config/riscv/riscv-vector-builtins.cc:4670:36: if ((exts & RVV_REQUIRE_ELEN_64) && Passed the ./contrib/check_GNU_style.sh for this patch, and double checked there is no other format issue of the original patch. Committed as format change. gcc/ChangeLog: * config/riscv/riscv-vector-builtins.cc (validate_instance_type_required_extensions): Remove the operator from the trailing and put it to new line. Signed-off-by: Pan Li <[email protected]>

Here during overload resolution we have two strictly viable ambiguous candidates #1 and #2, and two non-strictly viable candidates #3 and #4 which we hold on to ever since r14-6522. These latter candidates have an empty second arg conversion since the first arg conversion was deemed bad, and this trips up joust when called on #3 and #4 which assumes all arg conversions are there. We can fix this by making joust robust to empty arg conversions, but in this situation we shouldn't need to compare #3 and #4 at all given that we have a strictly viable candidate. To that end, this patch makes tourney shortcut considering non-strictly viable candidates upon encountering ambiguity between two strictly viable candidates (taking advantage of the fact that the candidates list is sorted according to viability via splice_viable). PR c++/115239 gcc/cp/ChangeLog: * call.cc (tourney): Don't consider a non-strictly viable candidate as the champ if there was ambiguity between two strictly viable candidates. gcc/testsuite/ChangeLog: * g++.dg/overload/error7.C: New test. Reviewed-by: Jason Merrill <[email protected]>

This patch improves GCC’s vectorization of __builtin_popcount for aarch64 target by adding popcount patterns for vector modes besides QImode, i.e., HImode, SImode and DImode. With this patch, we now generate the following for V8HI: cnt v1.16b, v0.16b uaddlp v2.8h, v1.16b For V4HI, we generate: cnt v1.8b, v0.8b uaddlp v2.4h, v1.8b For V4SI, we generate: cnt v1.16b, v0.16b uaddlp v2.8h, v1.16b uaddlp v3.4s, v2.8h For V4SI with TARGET_DOTPROD, we generate the following instead: movi v0.4s, #0 movi v1.16b, #1 cnt v3.16b, v2.16b udot v0.4s, v3.16b, v1.16b For V2SI, we generate: cnt v1.8b, v.8b uaddlp v2.4h, v1.8b uaddlp v3.2s, v2.4h For V2SI with TARGET_DOTPROD, we generate the following instead: movi v0.8b, #0 movi v1.8b, #1 cnt v3.8b, v2.8b udot v0.2s, v3.8b, v1.8b For V2DI, we generate: cnt v1.16b, v.16b uaddlp v2.8h, v1.16b uaddlp v3.4s, v2.8h uaddlp v4.2d, v3.4s For V4SI with TARGET_DOTPROD, we generate the following instead: movi v0.4s, #0 movi v1.16b, #1 cnt v3.16b, v2.16b udot v0.4s, v3.16b, v1.16b uaddlp v0.2d, v0.4s PR target/113859 gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_<su>addlp<mode>): Rename to... (@aarch64_<su>addlp<mode>): ... This. (popcount<mode>2): New define_expand. gcc/testsuite/ChangeLog: * gcc.target/aarch64/popcnt-udot.c: New test. * gcc.target/aarch64/popcnt-vec.c: New test. Signed-off-by: Pengxuan Zheng <[email protected]>

On many cores, including Neoverse V2 the throughput of vector ADD instructions is higher than vector shifts like SHL. We can lean on that to emit code like: add v0.4s, v0.4s, v0.4s instead of: shl v0.4s, v0.4s, 1 LLVM already does this trick. In RTL the code gets canonincalised from (plus x x) to (ashift x 1) so I opted to instead do this at the final assembly printing stage, similar to how we emit CMLT instead of SSHR elsewhere in the backend. I'd like to also do this for SVE shifts, but those will have to be separate patches. Signed-off-by: Kyrylo Tkachov <[email protected]> gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_simd_imm_shl<mode><vczle><vczbe>): Rewrite to new syntax. Add =w,w,vs1 alternative. * config/aarch64/constraints.md (vs1): New constraint. gcc/testsuite/ChangeLog: * gcc.target/aarch64/advsimd_shl_add.c: New test.

This patch tweaks timode_scalar_chain::compute_convert_gain to better reflect the expansion of V1TImode arithmetic right shifts by the i386 backend. The comment "see ix86_expand_v1ti_ashiftrt" appears after "case ASHIFTRT" in compute_convert_gain, and the changes below attempt to better match the logic used there. The original motivating example is: __int128 m1; void foo() { m1 = (m1 << 8) >> 8; } which with -O2 -mavx2 we fail to convert to vector form due to the inappropriate cost of the arithmetic right shift. Instruction gain -16 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;} Total gain: -3 Chain #1 conversion is not profitable This is reporting that the ASHIFTRT is four instructions worse using vectors than in scalar form, which is incorrect as the AVX2 expansion of this shift only requires three instructions (and the scalar form requires two). With more accurate costs in timode_scalar_chain::compute_convert_gain we now see (with -O2 -mavx2): Instruction gain -4 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;} Total gain: 9 Converting chain #1... which results in: foo: vmovdqa m1(%rip), %xmm0 vpslldq $1, %xmm0, %xmm0 vpsrad $8, %xmm0, %xmm1 vpsrldq $1, %xmm0, %xmm0 vpblendd $7, %xmm0, %xmm1, %xmm0 vmovdqa %xmm0, m1(%rip) ret 2024-08-25 Roger Sayle <[email protected]> Uros Bizjak <[email protected]> gcc/ChangeLog * config/i386/i386-features.cc (compute_convert_gain) <case ASHIFTRT>: Update to match ix86_expand_v1ti_ashiftrt.

…o_debug_section [PR116614] cat abc.C #define A(n) struct T##n {} t##n; #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9) #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9) #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9) #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9) E(1) E(2) E(3) int main () { return 0; } ./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c ./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2 (not included in testsuite as it takes a while to compile) FAILs with lto-wrapper: fatal error: Too many copied sections: Operation not supported compilation terminated. /usr/bin/ld: error: lto-wrapper failed collect2: error: ld returned 1 exit status The following patch fixes that. Most of the 64K+ section support for reading and writing was already there years ago (and especially reading used quite often already) and a further bug fixed in it in the PR104617 fix. Yet, the fix isn't solely about removing the if (new_i - 1 >= SHN_LORESERVE) { *err = ENOTSUP; return "Too many copied sections"; } 5 lines, the missing part was that the function only handled reading of the .symtab_shndx section but not copying/updating of it. If the result has less than 64K-epsilon sections, that actually wasn't needed, but e.g. with -fdebug-types-section one can exceed that pretty easily (reported to us on WebKitGtk build on ppc64le). Updating the section is slightly more complicated, because it basically needs to be done in lock step with updating the .symtab section, if one doesn't need to use SHN_XINDEX in there, the section should (or should be updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would be overwise stored but couldn't fit. But repeating due to that all the symtab decisions what to discard and how to rewrite it would be ugly. So, the patch instead emits the .symtab_shndx section (or sections) last and prepares the content during the .symtab processing and in a second pass when going just through .symtab_shndx sections just uses the saved content. 2024-09-07 Jakub Jelinek <[email protected]> PR lto/116614 * simple-object-elf.c (SHN_COMMON): Align comment with neighbouring comments. (SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for consistency. (simple_object_elf_find_sections): Formatting fixes. (simple_object_elf_fetch_attributes): Likewise. (simple_object_elf_attributes_merge): Likewise. (simple_object_elf_start_write): Likewise. (simple_object_elf_write_ehdr): Likewise. (simple_object_elf_write_shdr): Likewise. (simple_object_elf_write_to_file): Likewise. (simple_object_elf_copy_lto_debug_section): Likewise. Don't fail for new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy over .symtab_shndx sections, though emit those last and compute their section content when processing associated .symtab sections. Handle simple_object_internal_read failure even in the .symtab_shndx reading case.

On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift instructions like SHL have a throughput of 2. We can lean on that to emit code like: add z31.b, z31.b, z31.b instead of: lsl z31.b, z31.b, #1 The implementation of this change for SVE vectors is similar to a prior patch <https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659958.html> that adds the above functionality for Neon vectors. Here, the machine descriptor pattern is split up to separately accommodate left and right shifts, so we can specifically emit an add for all left shifts by 1. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR <[email protected]> gcc/ChangeLog: * config/aarch64/aarch64-sve.md (*post_ra_v<optab><mode>3): Split pattern to accomodate left and right shifts separately. (*post_ra_v_ashl<mode>3): Matches left shifts with additional constraint to check for shifts by 1. (*post_ra_v_<optab><mode>3): Matches right shifts. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of lsl-1 with corresponding add. * gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise. * gcc.target/aarch64/sve/adr_1.c: Likewise. * gcc.target/aarch64/sve/adr_6.c: Likewise. * gcc.target/aarch64/sve/cond_mla_7.c: Likewise. * gcc.target/aarch64/sve/cond_mla_8.c: Likewise. * gcc.target/aarch64/sve/shift_2.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u8.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c: Likewise. * gcc.target/aarch64/sve/sve_shl_add.c: New test.

…on [PR113328] SVE's INDEX instruction can be used to populate vectors by values starting from "base" and incremented by "step" for each subsequent value. We can take advantage of it to generate vector constants if TARGET_SVE is available and the base and step values are within [-16, 15]. For example, with the following function: typedef int v4si __attribute__ ((vector_size (16))); v4si f_v4si (void) { return (v4si){ 0, 1, 2, 3 }; } GCC currently generates: f_v4si: adrp x0, .LC4 ldr q0, [x0, #:lo12:.LC4] ret .LC4: .word 0 .word 1 .word 2 .word 3 With this patch, we generate an INDEX instruction instead if TARGET_SVE is available. f_v4si: index z0.s, #0, #1 ret PR target/113328 gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_simd_valid_immediate): Improve handling of some ADVSIMD vectors by using SVE's INDEX if TARGET_SVE is available. (aarch64_output_simd_mov_immediate): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/acle/general/dupq_1.c: Update test to use SVE's INDEX instruction. * gcc.target/aarch64/sve/acle/general/dupq_2.c: Likewise. * gcc.target/aarch64/sve/acle/general/dupq_3.c: Likewise. * gcc.target/aarch64/sve/acle/general/dupq_4.c: Likewise. * gcc.target/aarch64/sve/vec_init_3.c: New test. Signed-off-by: Pengxuan Zheng <[email protected]>

Implement vddup and vidup using the new MVE builtins framework. We generate better code because we take advantage of the two outputs produced by the v[id]dup instructions. For instance, before: ldr r3, [r0] sub r2, r3, #8 str r2, [r0] mov r2, r3 vddup.u16 q3, r2, #1 now: ldr r2, [r0] vddup.u16 q3, r2, #1 str r2, [r0] 2024-08-21 Christophe Lyon <[email protected]> gcc/ * config/arm/arm-mve-builtins-base.cc (class viddup_impl): New. (vddup): New. (vidup): New. * config/arm/arm-mve-builtins-base.def (vddupq): New. (vidupq): New. * config/arm/arm-mve-builtins-base.h (vddupq): New. (vidupq): New. * config/arm/arm_mve.h (vddupq_m): Delete. (vddupq_u8): Delete. (vddupq_u32): Delete. (vddupq_u16): Delete. (vidupq_m): Delete. (vidupq_u8): Delete. (vidupq_u32): Delete. (vidupq_u16): Delete. (vddupq_x_u8): Delete. (vddupq_x_u16): Delete. (vddupq_x_u32): Delete. (vidupq_x_u8): Delete. (vidupq_x_u16): Delete. (vidupq_x_u32): Delete. (vddupq_m_n_u8): Delete. (vddupq_m_n_u32): Delete. (vddupq_m_n_u16): Delete. (vddupq_m_wb_u8): Delete. (vddupq_m_wb_u16): Delete. (vddupq_m_wb_u32): Delete. (vddupq_n_u8): Delete. (vddupq_n_u32): Delete. (vddupq_n_u16): Delete. (vddupq_wb_u8): Delete. (vddupq_wb_u16): Delete. (vddupq_wb_u32): Delete. (vidupq_m_n_u8): Delete. (vidupq_m_n_u32): Delete. (vidupq_m_n_u16): Delete. (vidupq_m_wb_u8): Delete. (vidupq_m_wb_u16): Delete. (vidupq_m_wb_u32): Delete. (vidupq_n_u8): Delete. (vidupq_n_u32): Delete. (vidupq_n_u16): Delete. (vidupq_wb_u8): Delete. (vidupq_wb_u16): Delete. (vidupq_wb_u32): Delete. (vddupq_x_n_u8): Delete. (vddupq_x_n_u16): Delete. (vddupq_x_n_u32): Delete. (vddupq_x_wb_u8): Delete. (vddupq_x_wb_u16): Delete. (vddupq_x_wb_u32): Delete. (vidupq_x_n_u8): Delete. (vidupq_x_n_u16): Delete. (vidupq_x_n_u32): Delete. (vidupq_x_wb_u8): Delete. (vidupq_x_wb_u16): Delete. (vidupq_x_wb_u32): Delete. (__arm_vddupq_m_n_u8): Delete. (__arm_vddupq_m_n_u32): Delete. (__arm_vddupq_m_n_u16): Delete. (__arm_vddupq_m_wb_u8): Delete. (__arm_vddupq_m_wb_u16): Delete. (__arm_vddupq_m_wb_u32): Delete. (__arm_vddupq_n_u8): Delete. (__arm_vddupq_n_u32): Delete. (__arm_vddupq_n_u16): Delete. (__arm_vidupq_m_n_u8): Delete. (__arm_vidupq_m_n_u32): Delete. (__arm_vidupq_m_n_u16): Delete. (__arm_vidupq_n_u8): Delete. (__arm_vidupq_m_wb_u8): Delete. (__arm_vidupq_m_wb_u16): Delete. (__arm_vidupq_m_wb_u32): Delete. (__arm_vidupq_n_u32): Delete. (__arm_vidupq_n_u16): Delete. (__arm_vidupq_wb_u8): Delete. (__arm_vidupq_wb_u16): Delete. (__arm_vidupq_wb_u32): Delete. (__arm_vddupq_wb_u8): Delete. (__arm_vddupq_wb_u16): Delete. (__arm_vddupq_wb_u32): Delete. (__arm_vddupq_x_n_u8): Delete. (__arm_vddupq_x_n_u16): Delete. (__arm_vddupq_x_n_u32): Delete. (__arm_vddupq_x_wb_u8): Delete. (__arm_vddupq_x_wb_u16): Delete. (__arm_vddupq_x_wb_u32): Delete. (__arm_vidupq_x_n_u8): Delete. (__arm_vidupq_x_n_u16): Delete. (__arm_vidupq_x_n_u32): Delete. (__arm_vidupq_x_wb_u8): Delete. (__arm_vidupq_x_wb_u16): Delete. (__arm_vidupq_x_wb_u32): Delete. (__arm_vddupq_m): Delete. (__arm_vddupq_u8): Delete. (__arm_vddupq_u32): Delete. (__arm_vddupq_u16): Delete. (__arm_vidupq_m): Delete. (__arm_vidupq_u8): Delete. (__arm_vidupq_u32): Delete. (__arm_vidupq_u16): Delete. (__arm_vddupq_x_u8): Delete. (__arm_vddupq_x_u16): Delete. (__arm_vddupq_x_u32): Delete. (__arm_vidupq_x_u8): Delete. (__arm_vidupq_x_u16): Delete. (__arm_vidupq_x_u32): Delete.

gcc.dg/torture/pr112305.c contains an inner loop that executes 0x8000_0014 times and an outer loop that executes 5 times, giving about 10 billion total executions of the inner loop body. At -O2 and above we are able to remove the inner loop, but at -O1 we keep a no-op loop: dls lr, r3 .L3: subs r3, r3, #1 le lr, .L3 and at -O0 we of course don't optimise. This can lead to long execution times on simulators, possibly triggering a timeout. gcc/testsuite * gcc.dg/torture/pr112305.c: Skip at -O0 and -O1 for simulators.

vec.h has this method: template<typename T, typename A> inline T * vec_safe_push (vec<T, A, vl_embed> *&v, const T &obj CXX_MEM_STAT_INFO) where v is a reference to a pointer to vec. This matches the regex for VecPrinter, so gdbhooks.py attempts to print it but chokes on the reference. I see the following: #1 0x0000000002b84b7b in vec_safe_push<edge_def*, va_gc> (v=Traceback (most recent call last): File "$SRC/gcc/gcc/gdbhooks.py", line 486, in to_string return '0x%x' % intptr(self.gdbval) File "$SRC/gcc/gcc/gdbhooks.py", line 168, in intptr return long(gdbval) if sys.version_info.major == 2 else int(gdbval) gdb.error: Cannot convert value to long. This patch makes VecPrinter handle such references by stripping them (dereferencing) at the top of the relevant functions. gcc/ChangeLog: * gdbhooks.py (strip_ref): New. Use it ... (VecPrinter.to_string): ... here, (VecPrinter.children): ... and here.

Brief: The bug appears in LRA after rematerialization pass while creating live ranges. File lra.cc: ************************************************************* /* Now we know what pseudos should be spilled. Try to rematerialize them first. */ if (lra_remat ()) { /* We need full live info -- see the comment above. */ lra_create_live_ranges (lra_reg_spill_p, true); ************************************************************* Wrong call `lra_create_live_ranges (lra_reg_spill_p, true)' It have to be `lra_create_live_ranges (true, true)'. The explanation: ********************************** int main (void) { if (a.u33 * a.u33 != 0) ------^^^^^^^^^^^^^ goto abrt; if (a.u33 * a.u40 * a.u33 != 0) ********************************** The bug appears here. Part of the expression `a.u33 * a.u33' Before LRA: ************************************************************* (insn 13 11 15 2 (set (reg:QI 184 [ _1+3 ]) (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2] <var_decl 0x7c866435d000 a>) (const_int 3 [0x3]))) [1 a+3 S1 A8])) "bf.c":11:8 86 {movqi_insn_split} (nil)) (insn 15 13 16 2 (set (reg:QI 64 [ a+4 ]) (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2] <var_decl 0x7c866435d000 a>) (const_int 4 [0x4]))) [1 a+4 S1 A8])) "bf.c":11:8 86 {movqi_insn_split} (nil)) (insn 16 15 20 2 (set (reg:QI 185 [ _1+4 ]) (zero_extract:QI (reg:QI 64 [ a+4 ]) (const_int 1 [0x1]) (const_int 0 [0]))) "bf.c":11:8 985 {*extzvqi_split} (nil)) ************************************************************* After LRA: ************************************************************* (insn 587 11 13 2 (set (reg:QI 24 r24 [368]) (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2] <var_decl 0x7c866435d000 a>) (const_int 3 [0x3]))) [1 a+3 S1 A8])) "bf.c":11:8 86 {movqi_insn_split} (nil)) (insn 13 587 15 2 (set (mem/c:QI (plus:HI (reg/f:HI 28 r28) (const_int 1 [0x1])) [4 %sfp+1 S1 A8]) (reg:QI 24 r24 [368])) "bf.c":11:8 86 {movqi_insn_split} (nil)) (insn 15 13 16 2 (set (reg:QI 6 r6 [orig:64 a+4 ] [64]) (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2] <var_decl 0x7c866435d000 a>) (const_int 4 [0x4]))) [1 a+4 S1 A8])) "bf.c":11:8 86 {movqi_insn_split} (nil)) (insn 16 15 572 2 (set (reg:QI 24 r24 [orig:185 _1+4 ] [185]) (zero_extract:QI (reg:QI 6 r6 [orig:64 a+4 ] [64]) (const_int 1 [0x1]) (const_int 0 [0]))) "bf.c":11:8 985 {*extzvqi_split} (nil)) (insn 572 16 20 2 (set (mem/c:QI (plus:HI (reg/f:HI 28 r28) (const_int 1 [0x1])) [4 %sfp+1 S1 A8]) (reg:QI 24 r24 [orig:185 _1+4 ] [185])) "bf.c":11:8 86 {movqi_insn_split} (nil)) ************************************************************* Insn 13 and insn 572 use sfp+1 as a spill slot, but in IRA pass it was a two different pseudos r184 and r185. Insns 13 use sfp+1 as a spill slot for r184 Insns 572 use the same slot for r185. It's wrong. Here we have a rematerialization. Fragment from bf.c.317r.reload: ************************************************************************************** ******** Rematerialization #1: ******** df_worklist_dataflow_doublequeue: n_basic_blocks 14 n_edges 18 count 14 ( 1) df_worklist_dataflow_doublequeue: n_basic_blocks 14 n_edges 18 count 14 ( 1) Cands: 0 (nop=0, remat_regno=185, reload_regno=359): (insn 16 15 572 2 (set (reg:QI 359 [orig:185 _1+4 ] [185]) (zero_extract:QI (reg:QI 64 [ a+4 ]) (const_int 1 [0x1]) (const_int 0 [0]))) "bf.c":11:8 985 {*extzvqi_split} (nil)) ************************************************************************************** [...] ************************************************************************************** Ranges after the compression: r185: [0..1] Frame pointer can not be eliminated anymore Spilling non-eliminable hard regs: 28 29 Spilling r113(28) Spilling r184(29) Spilling r208(29) Spilling r209(28) Slot 0 regnos (width = 0): 185 209 208 184 113 ************************************************************************************** The bug is here: `r185: [0..1]' wrong live range after compression. r185 and r184 can't have the same spill slot ! Rematerialization in bf.c.317r.reload looks like: ************************************************************* 24: r14:QI=r185:QI Inserting rematerialization insn before: 581: r14:QI=zero_extract(r64:QI,0x1,0) deleting insn with uid = 24. Considering alt=0 of insn 16: (0) =r (1) rYil (2) n overall=0,losers=0,rld_nregs=0 32: r22:QI=r185:QI Inserting rematerialization insn before: 582: r22:QI=zero_extract(r64:QI,0x1,0) deleting insn with uid = 32. ************************************************************* It's happened because: Fragment from lra.c (lra): ************************************************************************* if (! live_p) { /* We need full live info for spilling pseudos into registers instead of memory. */ lra_create_live_ranges (lra_reg_spill_p, true); live_p = true; } /* We should check necessity for spilling here as the above live range pass can remove spilled pseudos. */ if (! lra_need_for_spills_p ()) break; /* Now we know what pseudos should be spilled. Try to rematerialize them first. */ if (lra_remat ()) { /* We need full live info -- see the comment above. */ lra_create_live_ranges (lra_reg_spill_p, true); ----------------------------------^^^^^^^^^^^^^^^ live_p = true; ************************************************************************* The bug is here. Rematerialization sometimes can be like spilling pseudos into registers. 582: r22:QI=zero_extract(r64:QI,0x1,0) So, here we need a live ranges for all pseudos. PS: the patch will not affect any target with usable definition of TARGET_SPILL_CLASS hook. PR target/116778 gcc/ * lra-lives.cc (complete_info_p): Clarification of the comment. * lra.cc (lra): Create a full live info after rematerialization.

This PR reports a missed optimization. When we have: Str str{"Test"}; callback(str); as in the test, we're able to evaluate the Str::Str() call at compile time. But when we have: callback(Str{"Test"}); we are not. With this patch (in fact, it's Patrick's patch with a little tweak), we turn callback (TARGET_EXPR <D.2890, <<< Unknown tree: aggr_init_expr 5 __ct_comp D.2890 (struct Str *) <<< Unknown tree: void_cst >>> (const char *) "Test" >>>>) into callback (TARGET_EXPR <D.2890, {.str=(const char *) "Test", .length=4}>) I explored the idea of calling maybe_constant_value for the whole TARGET_EXPR in cp_fold. That has three problems: - we can't always elide a TARGET_EXPR, so we'd have to make sure the result is also a TARGET_EXPR; - the resulting TARGET_EXPR must have the same flags, otherwise Bad Things happen; - getting a new slot is also problematic. I've seen a test where we had "TARGET_EXPR<D.2680, ...>, D.2680", and folding the whole TARGET_EXPR would get us "TARGET_EXPR<D.2681, ...>", but since we don't see the outer D.2680, we can't replace it with D.2681, and things break. With this patch, two tree-ssa tests regressed: pr78687.C and pr90883.C. FAIL: g++.dg/tree-ssa/pr90883.C scan-tree-dump dse1 "Deleted redundant store: .*.a = {}" is easy. Previously, we would call C::C, so .gimple has: D.2590 = {}; C::C (&D.2590); D.2597 = D.2590; return D.2597; Then .einline inlines the C::C call: D.2590 = {}; D.2590.a = {}; // #1 D.2590.b = 0; // #2 D.2597 = D.2590; D.2590 ={v} {CLOBBER(eos)}; return D.2597; then #2 is removed in .fre1, and #1 is removed in .dse1. So the test passes. But with the patch, .gimple won't have that C::C call, so the IL is of course going to look different. The .optimized dump looks the same though so there's no problem. pr78687.C is XFAILed because the test passes with r15-5746 but not with r15-5747 as well. I opened <https://gcc.gnu.org/PR117971>. PR c++/116416 gcc/cp/ChangeLog: * cp-gimplify.cc (cp_fold_r) <case TARGET_EXPR>: Try to fold TARGET_EXPR_INITIAL and replace it with the folded result if it's TREE_CONSTANT. gcc/testsuite/ChangeLog: * g++.dg/analyzer/pr97116.C: Adjust dg-message. * g++.dg/tree-ssa/pr78687.C: Add XFAIL. * g++.dg/tree-ssa/pr90883.C: Adjust dg-final. * g++.dg/cpp0x/constexpr-prvalue1.C: New test. * g++.dg/cpp1y/constexpr-prvalue1.C: New test. Co-authored-by: Patrick Palka <[email protected]> Reviewed-by: Jason Merrill <[email protected]>

@far

With the changes in r15-1579-g792f97b44ff, the code used as "padding" in the test case is optimized way. Prevent this optimization by forcing a read of the volatile memory. Also, validate that there is a far jump in the generated assembler. Without this patch, the generated assembler is reduced to: f3: cmp r0, #0 beq .L1 ldr r4, .L6 .L1: bx lr .L7: .align 2 .L6: .word g_0_1 With the patch, the generated assembler is: f3: movs r2, #1 ldr r3, .L6 push {lr} str r2, [r3] cmp r0, #0 bne .LCB10 bl .L1 @far jump .LCB10: b .L7 .L8: .align 2 .L6: .word .LANCHOR0 .L7: str r2, [r3] ... str r2, [r3] .L1: pop {pc} gcc/testsuite/ChangeLog: * gcc.target/arm/thumb1-far-jump-2.c: Write to volatile memmory in macro to avoid optimization. Signed-off-by: Torbjörn SVENSSON <[email protected]>

On Cortex-M4, the code generated is: cmp r0, r1 itte ne lslne r0, r0, r1 asrne r0, r0, #1 moveq r0, r1 add r0, r0, r1 bx lr On Cortex-M7, the code generated is: cmp r0, r1 beq .L3 lsls r0, r0, r1 asrs r0, r0, #1 add r0, r0, r1 bx lr .L3: mov r0, r1 add r0, r0, r1 bx lr As Cortex-M7 only allow maximum one conditional instruction, force Cortex-M4 to have a stable test case. gcc/testsuite/ChangeLog: * gcc.target/arm/thumb-ifcvt.c: Use -mtune=cortex-m4. Signed-off-by: Torbjörn SVENSSON <[email protected]>

This crash started with my r12-7803 but I believe the problem lies elsewhere. build_vec_init has cleanup_flags whose purpose is -- if I grok this correctly -- to avoid destructing an object multiple times. Let's say we are initializing an array of A. Then we might end up in a scenario similar to initlist-eh1.C: try { call A::A in a loop // #0 try { call a fn using the array } finally { // #1 call A::~A in a loop } } catch { // #2 call A::~A in a loop } cleanup_flags makes us emit a statement like D.3048 = 2; at #0 to disable performing the cleanup at #2, since #1 will take care of the destruction of the array. But if we are not emitting the loop because we can use a constant initializer (and use a single { a, b, ...}), we shouldn't generate the statement resetting the iterator to its initial value. Otherwise we crash in gimplify_var_or_parm_decl because it gets the stray decl D.3048. PR c++/117985 gcc/cp/ChangeLog: * init.cc (build_vec_init): Pop CLEANUP_FLAGS if we're not generating the loop. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/initlist-array23.C: New test. * g++.dg/cpp0x/initlist-array24.C: New test.

This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and use_new_vector_costs entry in aarch64-tuning-flags.def and makes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the default. To that end, the function aarch64_use_new_vector_costs_p and its uses were removed. To prevent costing vec_to_scalar operations with 0, as described in https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, we adjusted vectorizable_store such that the variable n_adjacent_stores also covers vec_to_scalar operations. This way vec_to_scalar operations are not costed individually, but as a group. As suggested by Richard Sandiford, the "known_ne" in the multilane-check was replaced by "maybe_ne" in order to treat nunits==1+1X as a vector rather than a scalar. Two tests were adjusted due to changes in codegen. In both cases, the old code performed loop unrolling once, but the new code does not: Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none): f_int64_t_32: cbz w3, .L92 mov x4, 0 uxtw x3, w3 + cntd x5 + whilelo p7.d, xzr, x3 + mov z29.s, w5 mov z31.s, w2 - whilelo p6.d, xzr, x3 - mov x2, x3 - index z30.s, #0, #1 - uqdecd x2 - ptrue p5.b, all - whilelo p7.d, xzr, x2 + index z30.d, #0, #1 + ptrue p6.b, all .p2align 3,,7 .L94: - ld1d z27.d, p7/z, [x0, #1, mul vl] - ld1d z28.d, p6/z, [x0] - movprfx z29, z31 - mul z29.s, p5/m, z29.s, z30.s - incw x4 - uunpklo z0.d, z29.s - uunpkhi z29.d, z29.s - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] - add z25.d, z28.d, z25.d + ld1d z27.d, p7/z, [x0, x4, lsl 3] + movprfx z28, z31 + mul z28.s, p6/m, z28.s, z30.s + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] add z26.d, z27.d, z26.d - st1d z26.d, p7, [x0, #1, mul vl] - whilelo p7.d, x4, x2 - st1d z25.d, p6, [x0] - incw z30.s - incb x0, all, mul #2 - whilelo p6.d, x4, x3 + st1d z26.d, p7, [x0, x4, lsl 3] + add z30.s, z30.s, z29.s + incd x4 + whilelo p7.d, x4, x3 b.any .L94 .L92: ret Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none): f_int64_t_32: cbz w3, .L84 - addvl x5, x1, #1 mov x4, 0 uxtw x3, w3 - mov z31.s, w2 + cntd x5 whilelo p7.d, xzr, x3 - mov x2, x3 - index z30.s, #0, #1 - uqdecd x2 - ptrue p5.b, all - whilelo p6.d, xzr, x2 + mov z29.s, w5 + mov z31.s, w2 + index z30.d, #0, #1 + ptrue p6.b, all .p2align 3,,7 .L86: - ld1d z28.d, p7/z, [x1, x4, lsl 3] - ld1d z27.d, p6/z, [x5, x4, lsl 3] - movprfx z29, z30 - mul z29.s, p5/m, z29.s, z31.s - add z28.d, z28.d, #1 - uunpklo z26.d, z29.s - st1d z28.d, p7, [x0, z26.d, lsl 3] - incw x4 - uunpkhi z29.d, z29.s + ld1d z27.d, p7/z, [x1, x4, lsl 3] + movprfx z28, z30 + mul z28.s, p6/m, z28.s, z31.s add z27.d, z27.d, #1 - whilelo p6.d, x4, x2 - st1d z27.d, p7, [x0, z29.d, lsl 3] - incw z30.s + st1d z27.d, p7, [x0, z28.d, uxtw 3] + incd x4 + add z30.s, z30.s, z29.s whilelo p7.d, x4, x3 b.any .L86 .L84: ret The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <[email protected]> gcc/ * tree-vect-stmts.cc (vectorizable_store): Extend the use of n_adjacent_stores to also cover vec_to_scalar operations. * config/aarch64/aarch64-tuning-flags.def: Remove use_new_vector_costs as tuning option. * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): Remove. (aarch64_vector_costs::add_stmt_cost): Remove use of aarch64_use_new_vector_costs_p. (aarch64_vector_costs::finish_cost): Remove use of aarch64_use_new_vector_costs_p. * config/aarch64/tuning_models/cortexx925.h: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. * config/aarch64/tuning_models/neoversen2.h: Likewise. * config/aarch64/tuning_models/neoversen3.h: Likewise. * config/aarch64/tuning_models/neoversev1.h: Likewise. * config/aarch64/tuning_models/neoversev2.h: Likewise. * config/aarch64/tuning_models/neoversev3.h: Likewise. * config/aarch64/tuning_models/neoversev3ae.h: Likewise. gcc/testsuite/ * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. * gcc.target/aarch64/sve/strided_store_2.c: Likewise.

Add linux ci

1e27cd0

cooljeanius merged commit e6f92c6 into cooljeanius:me/CI Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GitHub CI #1

Add GitHub CI #1

cooljeanius commented Sep 10, 2023

Add GitHub CI #1

Add GitHub CI #1

Conversation

cooljeanius commented Sep 10, 2023