[BPF] Do atomic_fetch_*() pattern matching with memory ordering #107343

yonghong-song · 2024-09-05T02:19:45Z

Three commits in this pull request:
commit 1: implement pattern matching for memory ordering seq_cst, acq_rel, release, acquire and monotonic. Specially, for monotonic memory ordering (relaxed memory model), if no return value is used, locked insn is used.
commit 2: add support to handle dwarf atomic modifier in BTF generation. Actually atomic modifier is ignored in BTF.
commit 3: add tests for new atomic ordering support and BTF support with _Atomic type.
I removed RFC tag as now patch sets are in reasonable states.

For atomic fetch_and_() operations, do pattern matching with memory ordering
seq_cst, acq_rel, release, acquire and monotonic (relaxed). For fetch_and_()
operations with seq_cst/acq_rel/release/acquire ordering, atomic_fetch_()
instructions are generated. For monotonic ordering, locked insns are generated
if return value is not used. Otherwise, atomic_fetch_() insns are used.
The main motivation is to resolve the kernel issue [1].

The following are memory ordering are supported:
seq_cst, acq_rel, release, acquire, relaxed
Current gcc style _sync_fetch_and*() operations are all seq_cst.

To use explicit memory ordering, the _Atomic type is needed. The following is
an example:

$ cat test.c
\#include <stdatomic.h>
void f1(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}
void f2(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire);
}
void f3(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o && llvm-objdum
p -d test.o
$ ./run.sh
       
test.o: file format elf64-bpf
       
Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       1:       c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &= w2
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       4:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       7:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       8:       95 00 00 00 00 00 00 00 exit

The following is another example where return value is used:

$ cat test1.c
\#include <stdatomic.h>
int f1(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}  
int f2(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_acquire);
}  
int f3(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}  
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o && llvm-objdump -d test1.o
$ ./run.sh

test.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       1:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       4:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       7:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       8:       95 00 00 00 00 00 00 00 exit

You can see that for relaxed memory ordering, if return value is used, atomic_fetch_and()
insn is used. Otherwise, if return value is not used, locked insn is used.

Here is another example with global _Atomic variable:

$ cat test3.c
\#include <stdatomic.h>

_Atomic int i;

void f1(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_relaxed);
}
void f2(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o && llvm-objdump -d test3.o
$ ./run.sh

test3.o:        file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       1:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       3:       c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &= w1
       4:       95 00 00 00 00 00 00 00 exit
       
0000000000000028 <f2>:
       5:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       6:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       8:       c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
       9:       95 00 00 00 00 00 00 00 exit

Note that in the above compilations, '-g' is not used. The reason is due to the following IR
related to _Atomic type:

$clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c

The related debug info for test3.c:

!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true)
...
!16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17)
!17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)

If compiling test.c, the related debug info:

...
!19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25)
!20 = !DISubroutineType(types: !21)
!21 = !{null, !22}
!22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64)
!23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24)
!24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
!25 = !{!26}
!26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22)

All the above suggests _Atomic behaves like a modifier (e.g. const, restrict, volatile).
This seems true based on doc [1].

Without proper handling DW_TAG_atomic_type, llvm BTF generation will be incorrect since
the current implementation assumes no existence of DW_TAG_atomic_type. So we have
two choices here:
(1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF encoding.
(2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC behaves as a
modifier like const/volatile/restrict.

For choice (1), llvm bpf backend should skip dwarf::DW_TAG_atomic_type during
BTF generation whenever necessary.

For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and kernel
needs to handle that properly. The main advantage of it probably is to maintain
this atomic type so it is also available to skeleton. But I think for skeleton
a raw type might be good enough unless user space intends to do some atomic
operation with that, which is a unlikely case.

So I choose choice (1) in this RFC implementation. See the commit message of the second commit for details.

[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://dwarfstd.org/issues/131112.1.html

yonghong-song · 2024-09-05T02:21:13Z

cc @anakryiko @jemarch

eddyz87 · 2024-09-05T23:46:00Z

So, basically we want to generate lock *(u64 *)(rX + ...) += rY for __c11_atomic_fetch_and(..., memory_order_relaxed) and rX = atomic_fetch_add(...) for everything else.
memory_order_relaxed corresponds to monotonic LLVM IR ordering.
And basing on the referred mailing list discussion, this is done to allow ARM jit to generate LDADD instruction (w/o L or A or AL suffixes) for this C interface (which it currently does for lock ... instructions).

I was unable to figure out from the ARM documentation whether LDADD is monotonic or unordered. However, the test below shows that at-least we are on the same page with LLVM ARM backend:

$ cat test2.c
void f1(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }
void f2(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_CONSUME); }
void f3(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_ACQUIRE); }
void f4(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_RELEASE); }
void f5(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_ACQ_REL); }
void f6(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_SEQ_CST); }

$ clang --target=aarch64 -march=armv8.1-a -O2 test2.c -c -o - | llvm-objdump -Sdr -

<stdin>:	file format elf64-littleaarch64

Disassembly of section .text:

0000000000000000 <f1>:
       0: 52800148     	mov	w8, #0xa                // =10
       4: f8280008     	ldadd	x8, x8, [x0]
       8: d65f03c0     	ret

000000000000000c <f2>:
       c: 52800148     	mov	w8, #0xa                // =10
      10: f8a80008     	ldadda	x8, x8, [x0]
      14: d65f03c0     	ret
0000000000000018 <f3>:
      18: 52800148     	mov	w8, #0xa                // =10
      1c: f8a80008     	ldadda	x8, x8, [x0]
      20: d65f03c0     	ret
0000000000000024 <f4>:
      24: 52800148     	mov	w8, #0xa                // =10
      28: f8680008     	ldaddl	x8, x8, [x0]
      2c: d65f03c0     	ret
0000000000000030 <f5>:
      30: 52800148     	mov	w8, #0xa                // =10
      34: f8e80008     	ldaddal	x8, x8, [x0]
      38: d65f03c0     	ret
000000000000003c <f6>:
      3c: 52800148     	mov	w8, #0xa                // =10
      40: f8e80008     	ldaddal	x8, x8, [x0]
      44: d65f03c0     	ret

So, I think we are good.

Question: should BPF backend report and error if __ATOMIC_{CONSUME,ACQUIRE,RELEASE,ACQ_REL} is used?
LLVM documentation allows this.
Edit: currently clang crashes with a backtrace if one of these is used, probably better to report an error in a more user friendly way.

eddyz87 · 2024-09-06T00:03:40Z

Also note:

$ cat test2.c
long f1(_Atomic long *i) { return __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }

$ clang --target=bpf -mcpu=v3 -O2 test2.c -c -o - | llvm-objdump -Sdr -
test2.c:1:6: error: Invalid usage of the XADD return value
    1 | long f1(_Atomic long *i) { return __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }
      |      ^
1 error generated.

<stdin>:	file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:	b7 00 00 00 0a 00 00 00	r0 = 0xa
       1:	db 01 00 00 00 00 00 00	lock *(u64 *)(r1 + 0x0) += r0
       2:	95 00 00 00 00 00 00 00	exit

Which seems incorrect.

yonghong-song · 2024-09-06T04:54:06Z

Question: should BPF backend report and error if __ATOMIC_{CONSUME,ACQUIRE,RELEASE,ACQ_REL} is used?
LLVM documentation allows this.
Edit: currently clang crashes with a backtrace if one of these is used, probably better to report an error in a more user friendly way.

Agree that Issuing an error message is more user friendly. Also __ATOMIC_CONSUME is an variant of __ATOMIC_ACQUIRE but looks like they may have subtle difference. In https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/_005f_005fatomic-Builtins.html

__ATOMIC_CONSUME
    Data dependency only for both barrier and synchronization with another thread.
__ATOMIC_ACQUIRE
    Barrier to hoisting of code and synchronizes with release (or stronger) semantic stores from another thread.

Linux kernel does not support __ATOMIC_CONSUME. So BPF backend may not support it anytime soon.

yonghong-song · 2024-09-06T04:55:27Z

Also note:

$ cat test2.c
long f1(_Atomic long *i) { return __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }

$ clang --target=bpf -mcpu=v3 -O2 test2.c -c -o - | llvm-objdump -Sdr -
test2.c:1:6: error: Invalid usage of the XADD return value
    1 | long f1(_Atomic long *i) { return __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }
      |      ^
1 error generated.

<stdin>:	file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:	b7 00 00 00 0a 00 00 00	r0 = 0xa
       1:	db 01 00 00 00 00 00 00	lock *(u64 *)(r1 + 0x0) += r0
       2:	95 00 00 00 00 00 00 00	exit

Which seems incorrect.

You are right. Yes, if the func return value is used, we should just use atomic_fetch_*() insns instead of locked insns.

anakryiko · 2024-09-06T19:11:28Z

Question: should BPF backend report and error if __ATOMIC_{CONSUME,ACQUIRE,RELEASE,ACQ_REL} is used?

Why not use the strongest ordering for all those "intermediate" orderings instead of erroring out? Just use __ATOMIC_SEQ_CST? If, in the future, BPF has weaker orderings, then we can always downgrade to that (probably with cpu version bump).

eddyz87 · 2024-09-06T19:20:52Z

Question: should BPF backend report and error if __ATOMIC_{CONSUME,ACQUIRE,RELEASE,ACQ_REL} is used?

Why not use the strongest ordering for all those "intermediate" orderings instead of erroring out? Just use __ATOMIC_SEQ_CST? If, in the future, BPF has weaker orderings, then we can always downgrade to that (probably with cpu version bump).

I assume that there would be some reasoning behind users asking for a specific memory order (e.g. hope for better performance), and thus it's better to let user know which orderings are supported.

But falling back to __ATOMIC_SEQ_CST is an option, yes.

anakryiko · 2024-09-06T19:28:36Z

Not all orderings are meaningful on all architectures, but compilers don't error out on that, no? Think about writers of some common helpers dealing with atomic primitives. Why should they worry about which specific mode is supported (especially that that might change over time depending on kernel version)? They request the minimum level of ordering, but they would equally well work with more strict ordering, if the underlying architecture (verifier/JIT in this case) doesn't support some of them on some older versions.

More broadly speaking (non-BPF specific), if I know I'm ok with relaxed ordering, but some architecture only supports SEQ_CST, I'm fine with that as a writer of atomic code. It's much bigger PITA to work around all these version/arch specific issues if they become compile-time or runtime errors, IMO.

eddyz87 · 2024-09-06T19:33:07Z

More broadly speaking (non-BPF specific), if I know I'm ok with relaxed ordering, but some architecture only supports SEQ_CST, I'm fine with that as a writer of atomic code. It's much bigger PITA to work around all these version/arch specific issues if they become compile-time or runtime errors, IMO.

Fair enough.
@yonghong-song , what do you think?

yonghong-song · 2024-09-09T00:04:41Z

@anakryiko @eddyz87 Yes, we can implement other memory ordering (acquire, release, act_rel) to be the same as seq_cst initially. Later on, we can invent new instructions in llvm to actually support acquire, release, act_rel properly. It is very likely that when we are able to land llvm/kernel change, we may already figure out the new insn format and we will just implement them at that time.

yonghong-song · 2024-09-09T05:38:46Z

Just made the change to support seq_cst/acq_rel/acquire/release/monotonic memory ordering. Also fixed the issue where locked insn is used when actual source expects a return value. I will continue to work on llvm BPF backend to process DW_TAG_atomic_type in the next step.

llvmbot · 2024-09-09T20:52:52Z

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-debuginfo

@llvm/pr-subscribers-clang

Author: None (yonghong-song)

Changes

For atomic fetch_and_() operations, do pattern matching with memory ordering
seq_cst, acq_rel, release, acquire and monotonic (relaxed). For fetch_and_()
operations with seq_cst/acq_rel/release/acquire ordering, atomic_fetch_()
instructions are generated. For monotonic ordering, locked insns are generated
if return value is not used. Otherwise, atomic_fetch_() insns are used.
The main motivation is to resolve the kernel issue [1].

The following are memory ordering are supported:
seq_cst, acq_rel, release, acquire, relaxed
Current gcc style _sync_fetch_and*() operations are all seq_cst.

To use explicit memory ordering, the _Atomic type is needed. The following is
an example:

$ cat test.c
\#include &lt;stdatomic.h&gt;
void f1(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}
void f2(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire);
}
void f3(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o &amp;&amp; llvm-objdum
p -d test.o
$ ./run.sh
       
test.o: file format elf64-bpf
       
Disassembly of section .text:

0000000000000000 &lt;f1&gt;:
       0:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       1:       c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &amp;= w2
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 &lt;f2&gt;:
       3:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       4:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 &lt;f3&gt;:
       6:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       7:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       8:       95 00 00 00 00 00 00 00 exit

The following is another example where return value is used:

$ cat test1.c
\#include &lt;stdatomic.h&gt;
int f1(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}  
int f2(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_acquire);
}  
int f3(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}  
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o &amp;&amp; llvm-objdump -d test1.o
$ ./run.sh

test.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 &lt;f1&gt;:
       0:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       1:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 &lt;f2&gt;:
       3:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       4:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 &lt;f3&gt;:
       6:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       7:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       8:       95 00 00 00 00 00 00 00 exit

You can see that for relaxed memory ordering, if return value is used, atomic_fetch_and()
insn is used. Otherwise, if return value is not used, locked insn is used.

Here is another example with global _Atomic variable:

$ cat test3.c
\#include &lt;stdatomic.h&gt;

_Atomic int i;

void f1(void) {
   (void)__c11_atomic_fetch_and(&amp;i, 10, memory_order_relaxed);
}
void f2(void) {
   (void)__c11_atomic_fetch_and(&amp;i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o &amp;&amp; llvm-objdump -d test3.o
$ ./run.sh

test3.o:        file format elf64-bpf

Disassembly of section .text:

0000000000000000 &lt;f1&gt;:
       0:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       1:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       3:       c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &amp;= w1
       4:       95 00 00 00 00 00 00 00 exit
       
0000000000000028 &lt;f2&gt;:
       5:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       6:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       8:       c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
       9:       95 00 00 00 00 00 00 00 exit

Note that in the above compilations, '-g' is not used. The reason is due to the following IR
related to _Atomic type:

$clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c

The related debug info for test3.c:

!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true)
...
!16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17)
!17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)

If compiling test.c, the related debug info:

...
!19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25)
!20 = !DISubroutineType(types: !21)
!21 = !{null, !22}
!22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64)
!23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24)
!24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
!25 = !{!26}
!26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22)

All the above suggests _Atomic behaves like a modifier (e.g. const, restrict, volatile).
This seems true based on doc [1].

Without proper handling DW_TAG_atomic_type, llvm BTF generation will be incorrect since
the current implementation assumes no existence of DW_TAG_atomic_type. So we have
two choices here:
(1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF encoding.
(2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC behaves as a
modifier like const/volatile/restrict.

For choice (1), the following is a hack which can make '-g' work for test1.c:

diff --git a/llvm/lib/Target/BPF/BTFDebug.cpp b/llvm/lib/Target/BPF/BTFDebug.cpp
index 4d847abea731..fd61bb811111 100644
--- a/llvm/lib/Target/BPF/BTFDebug.cpp
+++ b/llvm/lib/Target/BPF/BTFDebug.cpp
@@ -1444,8 +1444,14 @@ void BTFDebug::processGlobals(bool ProcessingMapDef) {
       DIGlobal = GVE-&gt;getVariable();
       if (SecName.starts_with(".maps"))
         visitMapDefType(DIGlobal-&gt;getType(), GVTypeId);
-      else
-        visitTypeEntry(DIGlobal-&gt;getType(), GVTypeId, false, false);
+      else {
+        const DIType *Ty = DIGlobal-&gt;getType();
+        auto *DTy = dyn_cast&lt;DIDerivedType&gt;(Ty);
+        if (DTy &amp;&amp; DTy-&gt;getTag() == dwarf::DW_TAG_atomic_type)
+          visitTypeEntry(DTy-&gt;getBaseType(), GVTypeId, false, false);
+        else
+          visitTypeEntry(Ty, GVTypeId, false, false);
+      }
       break;
     }

You can see that basicaly dwarf::DW_TAG_atomic_type is skipped during BTF generation.
Other changes are needed to avoid other usages of dwarf::DW_TAG_atomic_type.

For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and kernel
needs to handle that properly. The main advantage of it probably is to maintain this atomic
type so it is also available to skeleton. But I think for skeleton a raw type might be good
enough unless user space intends to do some atomic operation with that. Do we really
have such cases?

[1] https://lore.kernel.org/bpf/7b941f53-2a05-48ec-9032-8f106face3a3@linux.dev/
[2] https://dwarfstd.org/issues/131112.1.html

Full diff: https://github.com/llvm/llvm-project/pull/107343.diff

4 Files Affected:

(modified) clang/lib/Basic/Targets/BPF.cpp (+1)
(modified) llvm/lib/Target/BPF/BPFInstrInfo.td (+103-10)
(modified) llvm/lib/Target/BPF/BPFMIChecking.cpp (+80-11)
(modified) llvm/lib/Target/BPF/BTFDebug.cpp (+19-2)

diff --git a/clang/lib/Basic/Targets/BPF.cpp b/clang/lib/Basic/Targets/BPF.cpp
index a94ceee5a6a5e7..77e3a9388b0c46 100644
--- a/clang/lib/Basic/Targets/BPF.cpp
+++ b/clang/lib/Basic/Targets/BPF.cpp
@@ -37,6 +37,7 @@ void BPFTargetInfo::getTargetDefines(const LangOptions &Opts,
   }
 
   Builder.defineMacro("__BPF_FEATURE_ADDR_SPACE_CAST");
+  Builder.defineMacro("__BPF_FEATURE_ATOMIC_MEM_ORDERING");
 
   if (CPU.empty())
     CPU = "v3";
diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
index f7e17901c7ed5e..68b0d1b70efe20 100644
--- a/llvm/lib/Target/BPF/BPFInstrInfo.td
+++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
@@ -864,26 +864,119 @@ class XFALU32<BPFWidthModifer SizeOp, BPFArithOp Opc, string OpcodeStr,
 
 let Constraints = "$dst = $val" in {
   let Predicates = [BPFHasALU32], DecoderNamespace = "BPFALU32" in {
-    def XFADDW32 : XFALU32<BPF_W, BPF_ADD, "u32", "add", atomic_load_add_i32>;
-    def XFANDW32 : XFALU32<BPF_W, BPF_AND, "u32", "and", atomic_load_and_i32>;
-    def XFORW32  : XFALU32<BPF_W, BPF_OR,  "u32", "or",  atomic_load_or_i32>;
-    def XFXORW32 : XFALU32<BPF_W, BPF_XOR, "u32", "xor", atomic_load_xor_i32>;
+    def XFADDW32 : XFALU32<BPF_W, BPF_ADD, "u32", "add", atomic_load_add_i32_seq_cst>;
+    def XFANDW32 : XFALU32<BPF_W, BPF_AND, "u32", "and", atomic_load_and_i32_seq_cst>;
+    def XFORW32  : XFALU32<BPF_W, BPF_OR,  "u32", "or",  atomic_load_or_i32_seq_cst>;
+    def XFXORW32 : XFALU32<BPF_W, BPF_XOR, "u32", "xor", atomic_load_xor_i32_seq_cst>;
   }
 
   let Predicates = [BPFHasALU32] in {
-    def XFADDD : XFALU64<BPF_DW, BPF_ADD, "u64", "add", atomic_load_add_i64>;
+    def XFADDD : XFALU64<BPF_DW, BPF_ADD, "u64", "add", atomic_load_add_i64_seq_cst>;
   }
-  def XFANDD : XFALU64<BPF_DW, BPF_AND, "u64", "and", atomic_load_and_i64>;
-  def XFORD  : XFALU64<BPF_DW, BPF_OR,  "u64", "or",  atomic_load_or_i64>;
-  def XFXORD : XFALU64<BPF_DW, BPF_XOR, "u64", "xor", atomic_load_xor_i64>;
+  def XFANDD : XFALU64<BPF_DW, BPF_AND, "u64", "and", atomic_load_and_i64_seq_cst>;
+  def XFORD  : XFALU64<BPF_DW, BPF_OR,  "u64", "or",  atomic_load_or_i64_seq_cst>;
+  def XFXORD : XFALU64<BPF_DW, BPF_XOR, "u64", "xor", atomic_load_xor_i64_seq_cst>;
+}
+
+let Predicates = [BPFHasALU32] in {
+    def : Pat<(atomic_load_add_i32_monotonic ADDRri:$addr, GPR32:$val),
+              (XADDW32 ADDRri:$addr, GPR32:$val)>;
+    def : Pat<(atomic_load_add_i32_acquire ADDRri:$addr, GPR32:$val),
+              (XFADDW32 ADDRri:$addr, GPR32:$val)>;
+    def : Pat<(atomic_load_add_i32_release ADDRri:$addr, GPR32:$val),
+              (XFADDW32 ADDRri:$addr, GPR32:$val)>;
+    def : Pat<(atomic_load_add_i32_acq_rel ADDRri:$addr, GPR32:$val),
+              (XFADDW32 ADDRri:$addr, GPR32:$val)>;
+
+    def : Pat<(atomic_load_add_i64_monotonic ADDRri:$addr, GPR:$val),
+              (XADDD ADDRri:$addr, GPR:$val)>;
+    def : Pat<(atomic_load_add_i64_acquire ADDRri:$addr, GPR:$val),
+              (XFADDD ADDRri:$addr, GPR:$val)>;
+    def : Pat<(atomic_load_add_i64_release ADDRri:$addr, GPR:$val),
+              (XFADDD ADDRri:$addr, GPR:$val)>;
+    def : Pat<(atomic_load_add_i64_acq_rel ADDRri:$addr, GPR:$val),
+              (XFADDD ADDRri:$addr, GPR:$val)>;
 }
 
 // atomic_load_sub can be represented as a neg followed
 // by an atomic_load_add.
-def : Pat<(atomic_load_sub_i32 ADDRri:$addr, GPR32:$val),
+// FIXME: the below can probably be simplified.
+def : Pat<(atomic_load_sub_i32_monotonic ADDRri:$addr, GPR32:$val),
+          (XADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
+def : Pat<(atomic_load_sub_i32_acquire ADDRri:$addr, GPR32:$val),
+          (XFADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
+def : Pat<(atomic_load_sub_i32_release ADDRri:$addr, GPR32:$val),
+          (XFADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
+def : Pat<(atomic_load_sub_i32_acq_rel ADDRri:$addr, GPR32:$val),
+          (XFADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
+def : Pat<(atomic_load_sub_i32_seq_cst ADDRri:$addr, GPR32:$val),
           (XFADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
-def : Pat<(atomic_load_sub_i64 ADDRri:$addr, GPR:$val),
+
+def : Pat<(atomic_load_sub_i64_monotonic ADDRri:$addr, GPR:$val),
+          (XADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+def : Pat<(atomic_load_sub_i64_acquire ADDRri:$addr, GPR:$val),
+          (XFADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+def : Pat<(atomic_load_sub_i64_release ADDRri:$addr, GPR:$val),
+          (XFADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+def : Pat<(atomic_load_sub_i64_acq_rel ADDRri:$addr, GPR:$val),
           (XFADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+def : Pat<(atomic_load_sub_i64_seq_cst ADDRri:$addr, GPR:$val),
+          (XFADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+
+def : Pat<(atomic_load_and_i32_monotonic ADDRri:$addr, GPR32:$val),
+          (XANDW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_and_i32_acquire ADDRri:$addr, GPR32:$val),
+          (XFANDW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_and_i32_release ADDRri:$addr, GPR32:$val),
+          (XFANDW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_and_i32_acq_rel ADDRri:$addr, GPR32:$val),
+          (XFANDW32 ADDRri:$addr, GPR32:$val)>;
+
+
+def : Pat<(atomic_load_and_i64_monotonic ADDRri:$addr, GPR:$val),
+          (XANDD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_and_i64_acquire ADDRri:$addr, GPR:$val),
+          (XFANDD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_and_i64_release ADDRri:$addr, GPR:$val),
+          (XFANDD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_and_i64_acq_rel ADDRri:$addr, GPR:$val),
+          (XFANDD ADDRri:$addr, GPR:$val)>;
+
+def : Pat<(atomic_load_or_i32_monotonic ADDRri:$addr, GPR32:$val),
+          (XORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_or_i32_acquire ADDRri:$addr, GPR32:$val),
+          (XFORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_or_i32_release ADDRri:$addr, GPR32:$val),
+          (XFORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_or_i32_acq_rel ADDRri:$addr, GPR32:$val),
+          (XFORW32 ADDRri:$addr, GPR32:$val)>;
+
+def : Pat<(atomic_load_or_i64_monotonic ADDRri:$addr, GPR:$val),
+          (XORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_or_i64_acquire ADDRri:$addr, GPR:$val),
+          (XFORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_or_i64_release ADDRri:$addr, GPR:$val),
+          (XFORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_or_i64_acq_rel ADDRri:$addr, GPR:$val),
+          (XFORD ADDRri:$addr, GPR:$val)>;
+
+def : Pat<(atomic_load_xor_i32_monotonic ADDRri:$addr, GPR32:$val),
+          (XXORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_xor_i32_acquire ADDRri:$addr, GPR32:$val),
+          (XFXORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_xor_i32_release ADDRri:$addr, GPR32:$val),
+          (XFXORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_xor_i32_acq_rel ADDRri:$addr, GPR32:$val),
+          (XFXORW32 ADDRri:$addr, GPR32:$val)>;
+
+def : Pat<(atomic_load_xor_i64_monotonic ADDRri:$addr, GPR:$val),
+          (XXORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_xor_i64_acquire ADDRri:$addr, GPR:$val),
+          (XFXORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_xor_i64_release ADDRri:$addr, GPR:$val),
+          (XFXORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_xor_i64_acq_rel ADDRri:$addr, GPR:$val),
+          (XFXORD ADDRri:$addr, GPR:$val)>;
 
 // Atomic Exchange
 class XCHG<BPFWidthModifer SizeOp, string OpcodeStr, PatFrag OpNode>
diff --git a/llvm/lib/Target/BPF/BPFMIChecking.cpp b/llvm/lib/Target/BPF/BPFMIChecking.cpp
index 24224f6c1e9e66..6010539d21bad0 100644
--- a/llvm/lib/Target/BPF/BPFMIChecking.cpp
+++ b/llvm/lib/Target/BPF/BPFMIChecking.cpp
@@ -43,14 +43,14 @@ struct BPFMIPreEmitChecking : public MachineFunctionPass {
   // Initialize class variables.
   void initialize(MachineFunction &MFParm);
 
-  void processAtomicInsts();
+  bool processAtomicInsts();
 
 public:
   // Main entry point for this pass.
   bool runOnMachineFunction(MachineFunction &MF) override {
     if (!skipFunction(MF.getFunction())) {
       initialize(MF);
-      processAtomicInsts();
+      return processAtomicInsts();
     }
     return false;
   }
@@ -152,22 +152,91 @@ static bool hasLiveDefs(const MachineInstr &MI, const TargetRegisterInfo *TRI) {
   return false;
 }
 
-void BPFMIPreEmitChecking::processAtomicInsts() {
+bool BPFMIPreEmitChecking::processAtomicInsts() {
+  if (!MF->getSubtarget<BPFSubtarget>().getHasJmp32()) {
+    // Only check for cpu version 1 and 2.
+    for (MachineBasicBlock &MBB : *MF) {
+      for (MachineInstr &MI : MBB) {
+        if (MI.getOpcode() != BPF::XADDW && MI.getOpcode() != BPF::XADDD)
+          continue;
+
+        LLVM_DEBUG(MI.dump());
+        if (hasLiveDefs(MI, TRI)) {
+          DebugLoc Empty;
+          const DebugLoc &DL = MI.getDebugLoc();
+          const Function &F = MF->getFunction();
+          F.getContext().diagnose(DiagnosticInfoUnsupported{
+              F, "Invalid usage of the XADD return value", DL});
+        }
+      }
+    }
+  }
+
+  // Check return values of atomic_fetch_and_{add,and,or,xor}.
+  // If the return is not used, the atomic_fetch_and_<op> instruction
+  // is replaced with atomic_<op> instruction.
+  MachineInstr *ToErase = nullptr;
+  bool Changed = false;
+  const BPFInstrInfo *TII = MF->getSubtarget<BPFSubtarget>().getInstrInfo();
   for (MachineBasicBlock &MBB : *MF) {
     for (MachineInstr &MI : MBB) {
-      if (MI.getOpcode() != BPF::XADDW && MI.getOpcode() != BPF::XADDD)
+      if (ToErase) {
+        ToErase->eraseFromParent();
+        ToErase = nullptr;
+      }
+
+      if (MI.getOpcode() != BPF::XADDW32 && MI.getOpcode() != BPF::XADDD &&
+          MI.getOpcode() != BPF::XANDW32 && MI.getOpcode() != BPF::XANDD &&
+          MI.getOpcode() != BPF::XXORW32 && MI.getOpcode() != BPF::XXORD &&
+          MI.getOpcode() != BPF::XORW32 && MI.getOpcode() != BPF::XORD)
         continue;
 
-      LLVM_DEBUG(MI.dump());
-      if (hasLiveDefs(MI, TRI)) {
-        DebugLoc Empty;
-        const DebugLoc &DL = MI.getDebugLoc();
-        const Function &F = MF->getFunction();
-        F.getContext().diagnose(DiagnosticInfoUnsupported{
-            F, "Invalid usage of the XADD return value", DL});
+      if (!hasLiveDefs(MI, TRI))
+        continue;
+
+      LLVM_DEBUG(dbgs() << "Transforming "; MI.dump());
+      unsigned newOpcode;
+      switch (MI.getOpcode()) {
+      case BPF::XADDW32:
+        newOpcode = BPF::XFADDW32;
+        break;
+      case BPF::XADDD:
+        newOpcode = BPF::XFADDD;
+        break;
+      case BPF::XANDW32:
+        newOpcode = BPF::XFANDW32;
+        break;
+      case BPF::XANDD:
+        newOpcode = BPF::XFANDD;
+        break;
+      case BPF::XXORW32:
+        newOpcode = BPF::XFXORW32;
+        break;
+      case BPF::XXORD:
+        newOpcode = BPF::XFXORD;
+        break;
+      case BPF::XORW32:
+        newOpcode = BPF::XFORW32;
+        break;
+      case BPF::XORD:
+        newOpcode = BPF::XFORD;
+        break;
+      default:
+        llvm_unreachable("Incorrect Atomic Instruction Opcode");
       }
+
+      BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(newOpcode))
+          .add(MI.getOperand(0))
+          .add(MI.getOperand(1))
+          .add(MI.getOperand(2))
+          .add(MI.getOperand(3));
+
+      ToErase = &MI;
+      Changed = true;
     }
   }
+
+  return Changed;
 }
 
 } // namespace
diff --git a/llvm/lib/Target/BPF/BTFDebug.cpp b/llvm/lib/Target/BPF/BTFDebug.cpp
index 4d847abea731dc..1cd82720fa7e81 100644
--- a/llvm/lib/Target/BPF/BTFDebug.cpp
+++ b/llvm/lib/Target/BPF/BTFDebug.cpp
@@ -91,6 +91,12 @@ void BTFTypeDerived::completeType(BTFDebug &BDebug) {
 
   // The base type for PTR/CONST/VOLATILE could be void.
   const DIType *ResolvedType = DTy->getBaseType();
+  if (ResolvedType) {
+    const auto *DerivedTy = dyn_cast<DIDerivedType>(ResolvedType);
+    if (DerivedTy && DerivedTy->getTag() == dwarf::DW_TAG_atomic_type)
+      ResolvedType = DerivedTy->getBaseType();
+  }
+
   if (!ResolvedType) {
     assert((Kind == BTF::BTF_KIND_PTR || Kind == BTF::BTF_KIND_CONST ||
             Kind == BTF::BTF_KIND_VOLATILE) &&
@@ -800,6 +806,10 @@ void BTFDebug::visitDerivedType(const DIDerivedType *DTy, uint32_t &TypeId,
                                 bool CheckPointer, bool SeenPointer) {
   unsigned Tag = DTy->getTag();
 
+  if (Tag == dwarf::DW_TAG_atomic_type)
+    return visitTypeEntry(DTy->getBaseType(), TypeId, CheckPointer,
+                          SeenPointer);
+
   /// Try to avoid chasing pointees, esp. structure pointees which may
   /// unnecessary bring in a lot of types.
   if (CheckPointer && !SeenPointer) {
@@ -1444,8 +1454,15 @@ void BTFDebug::processGlobals(bool ProcessingMapDef) {
       DIGlobal = GVE->getVariable();
       if (SecName.starts_with(".maps"))
         visitMapDefType(DIGlobal->getType(), GVTypeId);
-      else
-        visitTypeEntry(DIGlobal->getType(), GVTypeId, false, false);
+      else {
+        const DIType *Ty = DIGlobal->getType();
+        if (Ty) {
+          auto *DTy = dyn_cast<DIDerivedType>(Ty);
+          if (DTy && DTy->getTag() == dwarf::DW_TAG_atomic_type)
+            Ty = DTy->getBaseType();
+        }
+        visitTypeEntry(Ty, GVTypeId, false, false);
+      }
       break;
     }

yonghong-song · 2024-09-09T21:03:07Z

Add change in BTFDebug.cpp to ingore dwarf::DW_TAG_atomic_type. Add some info about bpf prog atomic type usage and what skeleton looks like (see commit message in commit 2).

llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}() will generate atomic_fetch_*() insns even if the return value is not used. This is a deliberate choice to make sure barrier semantics are preserved from source code to asm insn. But the change in [1] caused arena_atomics selftest failure. test_arena_atomics:PASS:arena atomics skeleton open 0 nsec libbpf: prog 'and': BPF program load failed: Permission denied libbpf: prog 'and': -- BEGIN PROG LOAD LOG -- arg#0 reference type('UNKNOWN ') size cannot be determined: -22 0: R1=ctx() R10=fp0 ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87 0: (18) r1 = 0xffffc90000064000 ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) 2: (61) r6 = *(u32 *)(r1 +0) ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v ar_off=(0x0; 0xffffffff)) 3: (85) call bpf_get_current_pid_tgid#14 ; R0_w=scalar() 4: (77) r0 >>= 32 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 5: (5d) if r0 != r6 goto pc+11 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x) ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91 6: (18) r1 = 0x100000000060 ; R1_w=scalar() 8: (bf) r1 = addr_space_cast(r1, 0, 1) ; R1_w=arena 9: (18) r2 = 0x1100000000 ; R2_w=0x1100000000 11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2) BPF_ATOMIC stores into R1 arena is not allowed processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- libbpf: prog 'and': failed to load: -13 libbpf: failed to load object 'arena_atomics' libbpf: failed to load BPF skeleton 'arena_atomics': -13 test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13) #3 arena_atomics:FAIL The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not allowed by arena addresses. Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion suggested to find a way from source to generate locked insn which is more user friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and the return value is not used, locked insn could be generated. So with llvm patch [4] to compile the bpf selftest, the following code __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed); is able to generate locked insn, hence fixing the selftest failure. [1] llvm/llvm-project#106494 [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") [3] https://lore.kernel.org/bpf/[email protected]/ [4] llvm/llvm-project#107343 Signed-off-by: Yonghong Song <[email protected]>

eddyz87

I agree with these changes, could you please also add tests?

llvm/lib/Target/BPF/BPFInstrInfo.td

llvm/lib/Target/BPF/BTFDebug.cpp

llvm/lib/Target/BPF/BPFMIChecking.cpp

yonghong-song · 2024-09-10T14:47:51Z

For

I agree with these changes, could you please also add tests?

Sure. Will do.

llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}() will generate atomic_fetch_*() insns even if the return value is not used. This is a deliberate choice to make sure barrier semantics are preserved from source code to asm insn. But the change in [1] caused arena_atomics selftest failure. test_arena_atomics:PASS:arena atomics skeleton open 0 nsec libbpf: prog 'and': BPF program load failed: Permission denied libbpf: prog 'and': -- BEGIN PROG LOAD LOG -- arg#0 reference type('UNKNOWN ') size cannot be determined: -22 0: R1=ctx() R10=fp0 ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87 0: (18) r1 = 0xffffc90000064000 ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) 2: (61) r6 = *(u32 *)(r1 +0) ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v ar_off=(0x0; 0xffffffff)) 3: (85) call bpf_get_current_pid_tgid#14 ; R0_w=scalar() 4: (77) r0 >>= 32 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 5: (5d) if r0 != r6 goto pc+11 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x) ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91 6: (18) r1 = 0x100000000060 ; R1_w=scalar() 8: (bf) r1 = addr_space_cast(r1, 0, 1) ; R1_w=arena 9: (18) r2 = 0x1100000000 ; R2_w=0x1100000000 11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2) BPF_ATOMIC stores into R1 arena is not allowed processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- libbpf: prog 'and': failed to load: -13 libbpf: failed to load object 'arena_atomics' libbpf: failed to load BPF skeleton 'arena_atomics': -13 test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13) #3 arena_atomics:FAIL The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not allowed by arena addresses. Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion suggested to find a way from source to generate locked insn which is more user friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and the return value is not used, locked insn could be generated. So with llvm patch [4] to compile the bpf selftest, the following code __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed); is able to generate locked insn, hence fixing the selftest failure. [1] llvm/llvm-project#106494 [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") [3] https://lore.kernel.org/bpf/[email protected]/ [4] llvm/llvm-project#107343 Signed-off-by: Yonghong Song <[email protected]>

llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}() will generate atomic_fetch_*() insns even if the return value is not used. This is a deliberate choice to make sure barrier semantics are preserved from source code to asm insn. But the change in [1] caused arena_atomics selftest failure. test_arena_atomics:PASS:arena atomics skeleton open 0 nsec libbpf: prog 'and': BPF program load failed: Permission denied libbpf: prog 'and': -- BEGIN PROG LOAD LOG -- arg#0 reference type('UNKNOWN ') size cannot be determined: -22 0: R1=ctx() R10=fp0 ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87 0: (18) r1 = 0xffffc90000064000 ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) 2: (61) r6 = *(u32 *)(r1 +0) ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v ar_off=(0x0; 0xffffffff)) 3: (85) call bpf_get_current_pid_tgid#14 ; R0_w=scalar() 4: (77) r0 >>= 32 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 5: (5d) if r0 != r6 goto pc+11 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x) ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91 6: (18) r1 = 0x100000000060 ; R1_w=scalar() 8: (bf) r1 = addr_space_cast(r1, 0, 1) ; R1_w=arena 9: (18) r2 = 0x1100000000 ; R2_w=0x1100000000 11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2) BPF_ATOMIC stores into R1 arena is not allowed processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- libbpf: prog 'and': failed to load: -13 libbpf: failed to load object 'arena_atomics' libbpf: failed to load BPF skeleton 'arena_atomics': -13 test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13) #3 arena_atomics:FAIL The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not allowed by arena addresses. Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion suggested to find a way from source to generate locked insn which is more user friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and the return value is not used, locked insn could be generated. So with llvm patch [4] to compile the bpf selftest, the following code __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed); is able to generate locked insn, hence fixing the selftest failure. [1] llvm/llvm-project#106494 [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") [3] https://lore.kernel.org/bpf/[email protected]/ [4] llvm/llvm-project#107343 Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

yonghong-song · 2024-09-13T19:08:54Z

This pull request has been used for the merged kernel patch:
https://lore.kernel.org/r/[email protected]

llvm/lib/Target/BPF/BPFMIChecking.cpp

llvm/lib/Target/BPF/BTFDebug.cpp

llvm/test/CodeGen/BPF/BTF/atomics.ll

eddyz87 · 2024-09-16T23:25:22Z

@yonghong-song, note: if all three commits remain a part of a single pull request, the are required to be squashed (link). If you want all three to be separate, each has to have it's separate pull request (one branch forked from another).

yonghong-song · 2024-09-24T04:22:24Z

@yonghong-song, note: if all three commits remain a part of a single pull request, the are required to be squashed (link). If you want all three to be separate, each has to have it's separate pull request (one branch forked from another).

I will squash them into a single commit.

yonghong-song · 2024-09-24T07:41:30Z

Upload a new version:

better pattern matching (differentiating return vs. no-return) to avoid late optimization hack (from @eddyz87)
use python script print_btf.py to dump BTF which is easier to understand (from @eddyz87)

github-actions · 2024-09-24T07:46:46Z

✅ With the latest revision this PR passed the Python code formatter.

llvm/test/CodeGen/BPF/atomics_mem_order_v1.ll

For atomic fetch_and_*() operations, do pattern matching with memory ordering seq_cst, acq_rel, release, acquire and monotonic (relaxed). For fetch_and_*() operations with seq_cst/acq_rel/release/acquire ordering, atomic_fetch_*() instructions are generated. For monotonic ordering, locked insns are generated if return value is not used. Otherwise, atomic_fetch_*() insns are used. The main motivation is to resolve the kernel issue [1]. The following are memory ordering are supported: seq_cst, acq_rel, release, acquire, relaxed Current gcc style __sync_fetch_and_*() operations are all seq_cst. To use explicit memory ordering, the _Atomic type is needed. The following is an example: ``` $ cat test.c \#include <stdatomic.h> void f1(_Atomic int *i) { (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed); } void f2(_Atomic int *i) { (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire); } void f3(_Atomic int *i) { (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst); } $ cat run.sh clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o && llvm-objdump -d test.o $ ./run.sh test.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <f1>: 0: b4 02 00 00 0a 00 00 00 w2 = 0xa 1: c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &= w2 2: 95 00 00 00 00 00 00 00 exit 0000000000000018 <f2>: 3: b4 02 00 00 0a 00 00 00 w2 = 0xa 4: c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2) 5: 95 00 00 00 00 00 00 00 exit 0000000000000030 <f3>: 6: b4 02 00 00 0a 00 00 00 w2 = 0xa 7: c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2) 8: 95 00 00 00 00 00 00 00 exit ``` The following is another example where return value is used: ``` $ cat test1.c \#include <stdatomic.h> int f1(_Atomic int *i) { return __c11_atomic_fetch_and(i, 10, memory_order_relaxed); } int f2(_Atomic int *i) { return __c11_atomic_fetch_and(i, 10, memory_order_acquire); } int f3(_Atomic int *i) { return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst); } $ cat run.sh clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o && llvm-objdump -d test1.o $ ./run.sh test.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <f1>: 0: b4 00 00 00 0a 00 00 00 w0 = 0xa 1: c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0) 2: 95 00 00 00 00 00 00 00 exit 0000000000000018 <f2>: 3: b4 00 00 00 0a 00 00 00 w0 = 0xa 4: c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0) 5: 95 00 00 00 00 00 00 00 exit 0000000000000030 <f3>: 6: b4 00 00 00 0a 00 00 00 w0 = 0xa 7: c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0) 8: 95 00 00 00 00 00 00 00 exit ``` You can see that for relaxed memory ordering, if return value is used, atomic_fetch_and() insn is used. Otherwise, if return value is not used, locked insn is used. Here is another example with global _Atomic variable: ``` $ cat test3.c \#include <stdatomic.h> _Atomic int i; void f1(void) { (void)__c11_atomic_fetch_and(&i, 10, memory_order_relaxed); } void f2(void) { (void)__c11_atomic_fetch_and(&i, 10, memory_order_seq_cst); } $ cat run.sh clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o && llvm-objdump -d test3.o $ ./run.sh test3.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <f1>: 0: b4 01 00 00 0a 00 00 00 w1 = 0xa 1: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll 3: c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &= w1 4: 95 00 00 00 00 00 00 00 exit 0000000000000028 <f2>: 5: b4 01 00 00 0a 00 00 00 w1 = 0xa 6: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll 8: c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1) 9: 95 00 00 00 00 00 00 00 exit ``` Note that in the above compilations, '-g' is not used. The reason is due to the following IR related to _Atomic type: ``` $clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c ``` The related debug info for test3.c: ``` !0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression()) !1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true) ... !16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17) !17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed) ``` If compiling test.c, the related debug info: ``` ... !19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25) !20 = !DISubroutineType(types: !21) !21 = !{null, !22} !22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64) !23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24) !24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed) !25 = !{!26} !26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22) ``` All the above suggests _Atomic behaves like a modifier (e.g. const, restrict, volatile). This seems true based on doc [1]. Without proper handling DW_TAG_atomic_type, llvm BTF generation will be incorrect since the current implementation assumes no existence of DW_TAG_atomic_type. So we have two choices here: (1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF encoding. (2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC behaves as a modifier like const/volatile/restrict. For choice (1), llvm bpf backend should skip dwarf::DW_TAG_atomic_type during BTF generation whenever necessary. For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and kernel needs to handle that properly. The main advantage of it probably is to maintain this atomic type so it is also available to skeleton. But I think for skeleton a raw type might be good enough unless user space intends to do some atomic operation with that, which is a unlikely case. So I choose choice (1) in this implementation. [1] https://lore.kernel.org/bpf/[email protected]/ [2] https://dwarfstd.org/issues/131112.1.html

Make change in BTFDebug.cpp to handle DW_TAG_atomic_type properly. Otherwise, a type like _Atomic int i; // global the dwarf type chain atomic->int Since DW_TAG_atomic_type is not processed BTF generation will stop at atomic modifier and BTF will encode 'i' as void type. Similar for type like volatile _Atomic int *p; the dwarf type chain ptr->volatile->atomic->int Since atomic type is not processed and BTF generation will stop at atomic type, the eventual BTF type will be ptr->volatile->void which is incorrect. This patch fixed the following cases including the above two patterns by skipping DW_TAG_atomic_type: - global variable with _Atomic type. - function parameter and return type with _Atomic type. - struct member with _Atomic type. - ptr,const,volatile,restrict pointing to a _Atomic type. - btf_type_tag where ptr pointing to _Atomic type and btf_type_tag. With changed llvm, in kernel selftest arena_atomics.c ([1]), the new bpf code looks like ``` _Atomic __u64 __arena_global and64_value = (0x110ull << 32); _Atomic __u32 __arena_global and32_value = 0x110; SEC("raw_tp/sys_enter") int and(const void *ctx) { ... __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed); __c11_atomic_fetch_and(&and32_value, 0x011, memory_order_relaxed); ... return 0; } ``` and compilation is successful. The skel file arena_atomics.skel.h will be ``` struct arena_atomics__arena { ... __u64 and64_value; __u32 and32_value; ... } *arena; ``` [1] https://lore.kernel.org/r/[email protected]

The following test cases are added: - all memory ordering and its asm codes with -mcpu=v3 - all memory ordering and its asm codes with -mcpu=v1 Note that __c11_atomic_fetch_{sub,and,or,xor} for 32bit won't work for -mcpu=v1. Also at -mcpu=v1, no return value allowed for 64bit __sync_fetch_and_add. - at -mcpu=v1, __c11_atomic_fetch_sub() for 64bit with relaxed memory ordering, the xaddd insn will be used so return value is not supported. Otherwise, it will work fine if return value is not used. This aligns to __c11_atomic_fetch_add() for 64bit with relaxed memory ordering at -mcpu=v1. - BTF test with _Atomic types in different cases.

yonghong-song · 2024-09-24T19:33:49Z

Uploaded new codes to address the missing case like

int _Atomic __attribute__((btf_type_tag("foo"))) *root;

jemarch · 2024-09-25T07:30:38Z

Merged #107343 into main.

I am preparing the corresponding patch for GCC.

uweigand · 2024-09-26T21:12:55Z

Looks like the new test case is failing on SystemZ: https://lab.llvm.org/buildbot/#/builders/42/builds/1192

struct.error: unpack_from requires a buffer of at least 402653196 bytes for unpacking 12 bytes at offset 402653184 (actual buffer size is 479)

At first glance, this might be an endian problem (SystemZ is big-endian; 402653196 is 0x1800000c).

…#107343) Three commits in this pull request: commit 1: implement pattern matching for memory ordering seq_cst, acq_rel, release, acquire and monotonic. Specially, for monotonic memory ordering (relaxed memory model), if no return value is used, locked insn is used. commit 2: add support to handle dwarf atomic modifier in BTF generation. Actually atomic modifier is ignored in BTF. commit 3: add tests for new atomic ordering support and BTF support with _Atomic type. I removed RFC tag as now patch sets are in reasonable states. For atomic fetch_and_*() operations, do pattern matching with memory ordering seq_cst, acq_rel, release, acquire and monotonic (relaxed). For fetch_and_*() operations with seq_cst/acq_rel/release/acquire ordering, atomic_fetch_*() instructions are generated. For monotonic ordering, locked insns are generated if return value is not used. Otherwise, atomic_fetch_*() insns are used. The main motivation is to resolve the kernel issue [1]. The following are memory ordering are supported: seq_cst, acq_rel, release, acquire, relaxed Current gcc style __sync_fetch_and_*() operations are all seq_cst. To use explicit memory ordering, the _Atomic type is needed. The following is an example: ``` $ cat test.c \#include <stdatomic.h> void f1(_Atomic int *i) { (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed); } void f2(_Atomic int *i) { (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire); } void f3(_Atomic int *i) { (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst); } $ cat run.sh clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o && llvm-objdum p -d test.o $ ./run.sh test.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <f1>: 0: b4 02 00 00 0a 00 00 00 w2 = 0xa 1: c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &= w2 2: 95 00 00 00 00 00 00 00 exit 0000000000000018 <f2>: 3: b4 02 00 00 0a 00 00 00 w2 = 0xa 4: c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2) 5: 95 00 00 00 00 00 00 00 exit 0000000000000030 <f3>: 6: b4 02 00 00 0a 00 00 00 w2 = 0xa 7: c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2) 8: 95 00 00 00 00 00 00 00 exit ``` The following is another example where return value is used: ``` $ cat test1.c \#include <stdatomic.h> int f1(_Atomic int *i) { return __c11_atomic_fetch_and(i, 10, memory_order_relaxed); } int f2(_Atomic int *i) { return __c11_atomic_fetch_and(i, 10, memory_order_acquire); } int f3(_Atomic int *i) { return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst); } $ cat run.sh clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o && llvm-objdump -d test1.o $ ./run.sh test.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <f1>: 0: b4 00 00 00 0a 00 00 00 w0 = 0xa 1: c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0) 2: 95 00 00 00 00 00 00 00 exit 0000000000000018 <f2>: 3: b4 00 00 00 0a 00 00 00 w0 = 0xa 4: c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0) 5: 95 00 00 00 00 00 00 00 exit 0000000000000030 <f3>: 6: b4 00 00 00 0a 00 00 00 w0 = 0xa 7: c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0) 8: 95 00 00 00 00 00 00 00 exit ``` You can see that for relaxed memory ordering, if return value is used, atomic_fetch_and() insn is used. Otherwise, if return value is not used, locked insn is used. Here is another example with global _Atomic variable: ``` $ cat test3.c \#include <stdatomic.h> _Atomic int i; void f1(void) { (void)__c11_atomic_fetch_and(&i, 10, memory_order_relaxed); } void f2(void) { (void)__c11_atomic_fetch_and(&i, 10, memory_order_seq_cst); } $ cat run.sh clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o && llvm-objdump -d test3.o $ ./run.sh test3.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <f1>: 0: b4 01 00 00 0a 00 00 00 w1 = 0xa 1: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll 3: c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &= w1 4: 95 00 00 00 00 00 00 00 exit 0000000000000028 <f2>: 5: b4 01 00 00 0a 00 00 00 w1 = 0xa 6: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll 8: c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1) 9: 95 00 00 00 00 00 00 00 exit ``` Note that in the above compilations, '-g' is not used. The reason is due to the following IR related to _Atomic type: ``` $clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c ``` The related debug info for test3.c: ``` !0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression()) !1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true) ... !16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17) !17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed) ``` If compiling test.c, the related debug info: ``` ... !19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25) !20 = !DISubroutineType(types: !21) !21 = !{null, !22} !22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64) !23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24) !24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed) !25 = !{!26} !26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22) ``` All the above suggests _Atomic behaves like a modifier (e.g. const, restrict, volatile). This seems true based on doc [1]. Without proper handling DW_TAG_atomic_type, llvm BTF generation will be incorrect since the current implementation assumes no existence of DW_TAG_atomic_type. So we have two choices here: (1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF encoding. (2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC behaves as a modifier like const/volatile/restrict. For choice (1), llvm bpf backend should skip dwarf::DW_TAG_atomic_type during BTF generation whenever necessary. For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and kernel needs to handle that properly. The main advantage of it probably is to maintain this atomic type so it is also available to skeleton. But I think for skeleton a raw type might be good enough unless user space intends to do some atomic operation with that, which is a unlikely case. So I choose choice (1) in this RFC implementation. See the commit message of the second commit for details. [1] https://lore.kernel.org/bpf/[email protected]/ [2] https://dwarfstd.org/issues/131112.1.html ---------

yonghong-song · 2024-09-27T03:48:30Z

Looks like the new test case is failing on SystemZ: https://lab.llvm.org/buildbot/#/builders/42/builds/1192
struct.error: unpack_from requires a buffer of at least 402653196 bytes for unpacking 12 bytes at offset 402653184 (actual buffer size is 479)
At first glance, this might be an endian problem (SystemZ is big-endian; 402653196 is 0x1800000c).

Thanks for reporting! The test run commands:

; RUN: llc -march=bpfel -mcpu=v3 -filetype=obj -o %t1 %s
; RUN: llvm-objcopy --dump-section='.BTF'=%t2 %t1
; RUN: %python %p/print_btf.py %t2 | FileCheck -check-prefixes=CHECK %s

It is possible a print_btf.py issue. Let me investigate.

eddyz87 · 2024-09-27T22:10:06Z

@yonghong-song , @uweigand , the fix is available in #110332

…#107343) Three commits in this pull request: commit 1: implement pattern matching for memory ordering seq_cst, acq_rel, release, acquire and monotonic. Specially, for monotonic memory ordering (relaxed memory model), if no return value is used, locked insn is used. commit 2: add support to handle dwarf atomic modifier in BTF generation. Actually atomic modifier is ignored in BTF. commit 3: add tests for new atomic ordering support and BTF support with _Atomic type. I removed RFC tag as now patch sets are in reasonable states. For atomic fetch_and_*() operations, do pattern matching with memory ordering seq_cst, acq_rel, release, acquire and monotonic (relaxed). For fetch_and_*() operations with seq_cst/acq_rel/release/acquire ordering, atomic_fetch_*() instructions are generated. For monotonic ordering, locked insns are generated if return value is not used. Otherwise, atomic_fetch_*() insns are used. The main motivation is to resolve the kernel issue [1]. The following are memory ordering are supported: seq_cst, acq_rel, release, acquire, relaxed Current gcc style __sync_fetch_and_*() operations are all seq_cst. To use explicit memory ordering, the _Atomic type is needed. The following is an example: ``` $ cat test.c \#include <stdatomic.h> void f1(_Atomic int *i) { (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed); } void f2(_Atomic int *i) { (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire); } void f3(_Atomic int *i) { (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst); } $ cat run.sh clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o && llvm-objdum p -d test.o $ ./run.sh test.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <f1>: 0: b4 02 00 00 0a 00 00 00 w2 = 0xa 1: c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &= w2 2: 95 00 00 00 00 00 00 00 exit 0000000000000018 <f2>: 3: b4 02 00 00 0a 00 00 00 w2 = 0xa 4: c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2) 5: 95 00 00 00 00 00 00 00 exit 0000000000000030 <f3>: 6: b4 02 00 00 0a 00 00 00 w2 = 0xa 7: c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2) 8: 95 00 00 00 00 00 00 00 exit ``` The following is another example where return value is used: ``` $ cat test1.c \#include <stdatomic.h> int f1(_Atomic int *i) { return __c11_atomic_fetch_and(i, 10, memory_order_relaxed); } int f2(_Atomic int *i) { return __c11_atomic_fetch_and(i, 10, memory_order_acquire); } int f3(_Atomic int *i) { return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst); } $ cat run.sh clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o && llvm-objdump -d test1.o $ ./run.sh test.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <f1>: 0: b4 00 00 00 0a 00 00 00 w0 = 0xa 1: c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0) 2: 95 00 00 00 00 00 00 00 exit 0000000000000018 <f2>: 3: b4 00 00 00 0a 00 00 00 w0 = 0xa 4: c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0) 5: 95 00 00 00 00 00 00 00 exit 0000000000000030 <f3>: 6: b4 00 00 00 0a 00 00 00 w0 = 0xa 7: c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0) 8: 95 00 00 00 00 00 00 00 exit ``` You can see that for relaxed memory ordering, if return value is used, atomic_fetch_and() insn is used. Otherwise, if return value is not used, locked insn is used. Here is another example with global _Atomic variable: ``` $ cat test3.c \#include <stdatomic.h> _Atomic int i; void f1(void) { (void)__c11_atomic_fetch_and(&i, 10, memory_order_relaxed); } void f2(void) { (void)__c11_atomic_fetch_and(&i, 10, memory_order_seq_cst); } $ cat run.sh clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o && llvm-objdump -d test3.o $ ./run.sh test3.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <f1>: 0: b4 01 00 00 0a 00 00 00 w1 = 0xa 1: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll 3: c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &= w1 4: 95 00 00 00 00 00 00 00 exit 0000000000000028 <f2>: 5: b4 01 00 00 0a 00 00 00 w1 = 0xa 6: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll 8: c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1) 9: 95 00 00 00 00 00 00 00 exit ``` Note that in the above compilations, '-g' is not used. The reason is due to the following IR related to _Atomic type: ``` $clang -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c ``` The related debug info for test3.c: ``` !0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression()) !1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true) ... !16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17) !17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed) ``` If compiling test.c, the related debug info: ``` ... !19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25) !20 = !DISubroutineType(types: !21) !21 = !{null, !22} !22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64) !23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24) !24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed) !25 = !{!26} !26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22) ``` All the above suggests _Atomic behaves like a modifier (e.g. const, restrict, volatile). This seems true based on doc [1]. Without proper handling DW_TAG_atomic_type, llvm BTF generation will be incorrect since the current implementation assumes no existence of DW_TAG_atomic_type. So we have two choices here: (1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF encoding. (2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC behaves as a modifier like const/volatile/restrict. For choice (1), llvm bpf backend should skip dwarf::DW_TAG_atomic_type during BTF generation whenever necessary. For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and kernel needs to handle that properly. The main advantage of it probably is to maintain this atomic type so it is also available to skeleton. But I think for skeleton a raw type might be good enough unless user space intends to do some atomic operation with that, which is a unlikely case. So I choose choice (1) in this RFC implementation. See the commit message of the second commit for details. [1] https://lore.kernel.org/bpf/[email protected]/ [2] https://dwarfstd.org/issues/131112.1.html ---------

yonghong-song requested review from 4ast and eddyz87 September 5, 2024 02:19

yonghong-song force-pushed the fix-atomic branch from 0e227d6 to a72772e Compare September 9, 2024 05:21

yonghong-song force-pushed the fix-atomic branch from a72772e to 8b82c54 Compare September 9, 2024 20:52

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Sep 9, 2024

yonghong-song force-pushed the fix-atomic branch from 8b82c54 to 00a3c60 Compare September 9, 2024 20:55

eddyz87 reviewed Sep 10, 2024

View reviewed changes

yonghong-song force-pushed the fix-atomic branch from 00a3c60 to 5e0909e Compare September 13, 2024 19:02

yonghong-song changed the title ~~[RFC][BPF] Do atomic_fetch_*() pattern matching with memory ordering~~ [BPF] Do atomic_fetch_*() pattern matching with memory ordering Sep 14, 2024

eddyz87 reviewed Sep 16, 2024

View reviewed changes

llvm/lib/Target/BPF/BPFMIChecking.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/BPF/BPFMIChecking.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/BPF/BTFDebug.cpp Show resolved Hide resolved

llvm/test/CodeGen/BPF/BTF/atomics.ll Outdated Show resolved Hide resolved

eddyz87 mentioned this pull request Sep 18, 2024

[BPF] Add load-acquire and store-release instructions under -mcpu=v4 #108636

Open

yonghong-song force-pushed the fix-atomic branch from 5e0909e to e4db1f2 Compare September 24, 2024 07:38

eddyz87 reviewed Sep 24, 2024

View reviewed changes

llvm/test/CodeGen/BPF/atomics_mem_order_v1.ll Show resolved Hide resolved

Yonghong Song added 3 commits September 24, 2024 12:27

yonghong-song force-pushed the fix-atomic branch from e4db1f2 to 4fda286 Compare September 24, 2024 19:32

llvmbot added clang:codegen debuginfo labels Sep 24, 2024

eddyz87 approved these changes Sep 24, 2024

View reviewed changes

yonghong-song merged commit 4c4fb6a into llvm:main Sep 24, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BPF] Do atomic_fetch_*() pattern matching with memory ordering #107343

[BPF] Do atomic_fetch_*() pattern matching with memory ordering #107343

yonghong-song commented Sep 5, 2024 •

edited

Loading

yonghong-song commented Sep 5, 2024

eddyz87 commented Sep 5, 2024 •

edited

Loading

eddyz87 commented Sep 6, 2024

yonghong-song commented Sep 6, 2024

yonghong-song commented Sep 6, 2024

anakryiko commented Sep 6, 2024

eddyz87 commented Sep 6, 2024 •

edited

Loading

anakryiko commented Sep 6, 2024 •

edited

Loading

eddyz87 commented Sep 6, 2024

yonghong-song commented Sep 9, 2024

yonghong-song commented Sep 9, 2024

llvmbot commented Sep 9, 2024 •

edited

Loading

yonghong-song commented Sep 9, 2024

eddyz87 left a comment

yonghong-song commented Sep 10, 2024

yonghong-song commented Sep 13, 2024

eddyz87 commented Sep 16, 2024

yonghong-song commented Sep 24, 2024

yonghong-song commented Sep 24, 2024

github-actions bot commented Sep 24, 2024 •

edited

Loading

yonghong-song commented Sep 24, 2024

jemarch commented Sep 25, 2024 via email

uweigand commented Sep 26, 2024

yonghong-song commented Sep 27, 2024 •

edited

Loading

eddyz87 commented Sep 27, 2024

[BPF] Do atomic_fetch_*() pattern matching with memory ordering #107343

[BPF] Do atomic_fetch_*() pattern matching with memory ordering #107343

Conversation

yonghong-song commented Sep 5, 2024 • edited Loading

yonghong-song commented Sep 5, 2024

eddyz87 commented Sep 5, 2024 • edited Loading

eddyz87 commented Sep 6, 2024

yonghong-song commented Sep 6, 2024

yonghong-song commented Sep 6, 2024

anakryiko commented Sep 6, 2024

eddyz87 commented Sep 6, 2024 • edited Loading

anakryiko commented Sep 6, 2024 • edited Loading

eddyz87 commented Sep 6, 2024

yonghong-song commented Sep 9, 2024

yonghong-song commented Sep 9, 2024

llvmbot commented Sep 9, 2024 • edited Loading

yonghong-song commented Sep 9, 2024

eddyz87 left a comment

Choose a reason for hiding this comment

yonghong-song commented Sep 10, 2024

yonghong-song commented Sep 13, 2024

eddyz87 commented Sep 16, 2024

yonghong-song commented Sep 24, 2024

yonghong-song commented Sep 24, 2024

github-actions bot commented Sep 24, 2024 • edited Loading

yonghong-song commented Sep 24, 2024

jemarch commented Sep 25, 2024 via email

uweigand commented Sep 26, 2024

yonghong-song commented Sep 27, 2024 • edited Loading

eddyz87 commented Sep 27, 2024

yonghong-song commented Sep 5, 2024 •

edited

Loading

eddyz87 commented Sep 5, 2024 •

edited

Loading

eddyz87 commented Sep 6, 2024 •

edited

Loading

anakryiko commented Sep 6, 2024 •

edited

Loading

llvmbot commented Sep 9, 2024 •

edited

Loading

github-actions bot commented Sep 24, 2024 •

edited

Loading

yonghong-song commented Sep 27, 2024 •

edited

Loading