Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BPF] Do atomic_fetch_*() pattern matching with memory ordering #107343

Merged
merged 3 commits into from
Sep 24, 2024

Conversation

yonghong-song
Copy link
Contributor

@yonghong-song yonghong-song commented Sep 5, 2024

Three commits in this pull request:
commit 1: implement pattern matching for memory ordering seq_cst, acq_rel, release, acquire and monotonic. Specially, for monotonic memory ordering (relaxed memory model), if no return value is used, locked insn is used.
commit 2: add support to handle dwarf atomic modifier in BTF generation. Actually atomic modifier is ignored in BTF.
commit 3: add tests for new atomic ordering support and BTF support with _Atomic type.
I removed RFC tag as now patch sets are in reasonable states.

For atomic fetch_and_() operations, do pattern matching with memory ordering
seq_cst, acq_rel, release, acquire and monotonic (relaxed). For fetch_and_
()
operations with seq_cst/acq_rel/release/acquire ordering, atomic_fetch_()
instructions are generated. For monotonic ordering, locked insns are generated
if return value is not used. Otherwise, atomic_fetch_
() insns are used.
The main motivation is to resolve the kernel issue [1].

The following are memory ordering are supported:
seq_cst, acq_rel, release, acquire, relaxed
Current gcc style _sync_fetch_and*() operations are all seq_cst.

To use explicit memory ordering, the _Atomic type is needed. The following is
an example:

$ cat test.c
\#include <stdatomic.h>
void f1(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}
void f2(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire);
}
void f3(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o && llvm-objdum
p -d test.o
$ ./run.sh
       
test.o: file format elf64-bpf
       
Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       1:       c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &= w2
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       4:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       7:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       8:       95 00 00 00 00 00 00 00 exit

The following is another example where return value is used:

$ cat test1.c
\#include <stdatomic.h>
int f1(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}  
int f2(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_acquire);
}  
int f3(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}  
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o && llvm-objdump -d test1.o
$ ./run.sh

test.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       1:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       4:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       7:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       8:       95 00 00 00 00 00 00 00 exit

You can see that for relaxed memory ordering, if return value is used, atomic_fetch_and()
insn is used. Otherwise, if return value is not used, locked insn is used.

Here is another example with global _Atomic variable:

$ cat test3.c
\#include <stdatomic.h>

_Atomic int i;

void f1(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_relaxed);
}
void f2(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o && llvm-objdump -d test3.o
$ ./run.sh

test3.o:        file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       1:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       3:       c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &= w1
       4:       95 00 00 00 00 00 00 00 exit
       
0000000000000028 <f2>:
       5:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       6:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       8:       c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
       9:       95 00 00 00 00 00 00 00 exit

Note that in the above compilations, '-g' is not used. The reason is due to the following IR
related to _Atomic type:

$clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c

The related debug info for test3.c:

!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true)
...
!16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17)
!17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)

If compiling test.c, the related debug info:

...
!19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25)
!20 = !DISubroutineType(types: !21)
!21 = !{null, !22}
!22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64)
!23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24)
!24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
!25 = !{!26}
!26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22)

All the above suggests _Atomic behaves like a modifier (e.g. const, restrict, volatile).
This seems true based on doc [1].

Without proper handling DW_TAG_atomic_type, llvm BTF generation will be incorrect since
the current implementation assumes no existence of DW_TAG_atomic_type. So we have
two choices here:
(1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF encoding.
(2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC behaves as a
modifier like const/volatile/restrict.

For choice (1), llvm bpf backend should skip dwarf::DW_TAG_atomic_type during
BTF generation whenever necessary.

For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and kernel
needs to handle that properly. The main advantage of it probably is to maintain
this atomic type so it is also available to skeleton. But I think for skeleton
a raw type might be good enough unless user space intends to do some atomic
operation with that, which is a unlikely case.

So I choose choice (1) in this RFC implementation. See the commit message of the second commit for details.

[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://dwarfstd.org/issues/131112.1.html

@yonghong-song
Copy link
Contributor Author

cc @anakryiko @jemarch

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 5, 2024

So, basically we want to generate lock *(u64 *)(rX + ...) += rY for __c11_atomic_fetch_and(..., memory_order_relaxed) and rX = atomic_fetch_add(...) for everything else.
memory_order_relaxed corresponds to monotonic LLVM IR ordering.
And basing on the referred mailing list discussion, this is done to allow ARM jit to generate LDADD instruction (w/o L or A or AL suffixes) for this C interface (which it currently does for lock ... instructions).

I was unable to figure out from the ARM documentation whether LDADD is monotonic or unordered. However, the test below shows that at-least we are on the same page with LLVM ARM backend:

$ cat test2.c
void f1(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }
void f2(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_CONSUME); }
void f3(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_ACQUIRE); }
void f4(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_RELEASE); }
void f5(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_ACQ_REL); }
void f6(_Atomic long *i) { __c11_atomic_fetch_add(i, 10, __ATOMIC_SEQ_CST); }
$ clang --target=aarch64 -march=armv8.1-a -O2 test2.c -c -o - | llvm-objdump -Sdr -
<stdin>:	file format elf64-littleaarch64

Disassembly of section .text:

0000000000000000 <f1>:
       0: 52800148     	mov	w8, #0xa                // =10
       4: f8280008     	ldadd	x8, x8, [x0]
       8: d65f03c0     	ret

000000000000000c <f2>:
       c: 52800148     	mov	w8, #0xa                // =10
      10: f8a80008     	ldadda	x8, x8, [x0]
      14: d65f03c0     	ret
0000000000000018 <f3>:
      18: 52800148     	mov	w8, #0xa                // =10
      1c: f8a80008     	ldadda	x8, x8, [x0]
      20: d65f03c0     	ret
0000000000000024 <f4>:
      24: 52800148     	mov	w8, #0xa                // =10
      28: f8680008     	ldaddl	x8, x8, [x0]
      2c: d65f03c0     	ret
0000000000000030 <f5>:
      30: 52800148     	mov	w8, #0xa                // =10
      34: f8e80008     	ldaddal	x8, x8, [x0]
      38: d65f03c0     	ret
000000000000003c <f6>:
      3c: 52800148     	mov	w8, #0xa                // =10
      40: f8e80008     	ldaddal	x8, x8, [x0]
      44: d65f03c0     	ret

So, I think we are good.

Question: should BPF backend report and error if __ATOMIC_{CONSUME,ACQUIRE,RELEASE,ACQ_REL} is used?
LLVM documentation allows this.
Edit: currently clang crashes with a backtrace if one of these is used, probably better to report an error in a more user friendly way.

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 6, 2024

Also note:

$ cat test2.c
long f1(_Atomic long *i) { return __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }
$ clang --target=bpf -mcpu=v3 -O2 test2.c -c -o - | llvm-objdump -Sdr -
test2.c:1:6: error: Invalid usage of the XADD return value
    1 | long f1(_Atomic long *i) { return __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }
      |      ^
1 error generated.

<stdin>:	file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:	b7 00 00 00 0a 00 00 00	r0 = 0xa
       1:	db 01 00 00 00 00 00 00	lock *(u64 *)(r1 + 0x0) += r0
       2:	95 00 00 00 00 00 00 00	exit

Which seems incorrect.

@yonghong-song
Copy link
Contributor Author

Question: should BPF backend report and error if __ATOMIC_{CONSUME,ACQUIRE,RELEASE,ACQ_REL} is used?
LLVM documentation allows this.
Edit: currently clang crashes with a backtrace if one of these is used, probably better to report an error in a more user friendly way.

Agree that Issuing an error message is more user friendly. Also __ATOMIC_CONSUME is an variant of __ATOMIC_ACQUIRE but looks like they may have subtle difference. In https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/_005f_005fatomic-Builtins.html

__ATOMIC_CONSUME
    Data dependency only for both barrier and synchronization with another thread.
__ATOMIC_ACQUIRE
    Barrier to hoisting of code and synchronizes with release (or stronger) semantic stores from another thread. 

Linux kernel does not support __ATOMIC_CONSUME. So BPF backend may not support it anytime soon.

@yonghong-song
Copy link
Contributor Author

Also note:

$ cat test2.c
long f1(_Atomic long *i) { return __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }
$ clang --target=bpf -mcpu=v3 -O2 test2.c -c -o - | llvm-objdump -Sdr -
test2.c:1:6: error: Invalid usage of the XADD return value
    1 | long f1(_Atomic long *i) { return __c11_atomic_fetch_add(i, 10, __ATOMIC_RELAXED); }
      |      ^
1 error generated.

<stdin>:	file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:	b7 00 00 00 0a 00 00 00	r0 = 0xa
       1:	db 01 00 00 00 00 00 00	lock *(u64 *)(r1 + 0x0) += r0
       2:	95 00 00 00 00 00 00 00	exit

Which seems incorrect.

You are right. Yes, if the func return value is used, we should just use atomic_fetch_*() insns instead of locked insns.

@anakryiko
Copy link

Question: should BPF backend report and error if __ATOMIC_{CONSUME,ACQUIRE,RELEASE,ACQ_REL} is used?

Why not use the strongest ordering for all those "intermediate" orderings instead of erroring out? Just use __ATOMIC_SEQ_CST? If, in the future, BPF has weaker orderings, then we can always downgrade to that (probably with cpu version bump).

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 6, 2024

Question: should BPF backend report and error if __ATOMIC_{CONSUME,ACQUIRE,RELEASE,ACQ_REL} is used?

Why not use the strongest ordering for all those "intermediate" orderings instead of erroring out? Just use __ATOMIC_SEQ_CST? If, in the future, BPF has weaker orderings, then we can always downgrade to that (probably with cpu version bump).

I assume that there would be some reasoning behind users asking for a specific memory order (e.g. hope for better performance), and thus it's better to let user know which orderings are supported.

But falling back to __ATOMIC_SEQ_CST is an option, yes.

@anakryiko
Copy link

anakryiko commented Sep 6, 2024

Not all orderings are meaningful on all architectures, but compilers don't error out on that, no? Think about writers of some common helpers dealing with atomic primitives. Why should they worry about which specific mode is supported (especially that that might change over time depending on kernel version)? They request the minimum level of ordering, but they would equally well work with more strict ordering, if the underlying architecture (verifier/JIT in this case) doesn't support some of them on some older versions.

More broadly speaking (non-BPF specific), if I know I'm ok with relaxed ordering, but some architecture only supports SEQ_CST, I'm fine with that as a writer of atomic code. It's much bigger PITA to work around all these version/arch specific issues if they become compile-time or runtime errors, IMO.

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 6, 2024

More broadly speaking (non-BPF specific), if I know I'm ok with relaxed ordering, but some architecture only supports SEQ_CST, I'm fine with that as a writer of atomic code. It's much bigger PITA to work around all these version/arch specific issues if they become compile-time or runtime errors, IMO.

Fair enough.
@yonghong-song , what do you think?

@yonghong-song
Copy link
Contributor Author

@anakryiko @eddyz87 Yes, we can implement other memory ordering (acquire, release, act_rel) to be the same as seq_cst initially. Later on, we can invent new instructions in llvm to actually support acquire, release, act_rel properly. It is very likely that when we are able to land llvm/kernel change, we may already figure out the new insn format and we will just implement them at that time.

@yonghong-song
Copy link
Contributor Author

Just made the change to support seq_cst/acq_rel/acquire/release/monotonic memory ordering. Also fixed the issue where locked insn is used when actual source expects a return value. I will continue to work on llvm BPF backend to process DW_TAG_atomic_type in the next step.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Sep 9, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 9, 2024

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-debuginfo

@llvm/pr-subscribers-clang

Author: None (yonghong-song)

Changes

For atomic fetch_and_() operations, do pattern matching with memory ordering
seq_cst, acq_rel, release, acquire and monotonic (relaxed). For fetch_and_
()
operations with seq_cst/acq_rel/release/acquire ordering, atomic_fetch_()
instructions are generated. For monotonic ordering, locked insns are generated
if return value is not used. Otherwise, atomic_fetch_
() insns are used.
The main motivation is to resolve the kernel issue [1].

The following are memory ordering are supported:
seq_cst, acq_rel, release, acquire, relaxed
Current gcc style _sync_fetch_and*() operations are all seq_cst.

To use explicit memory ordering, the _Atomic type is needed. The following is
an example:

$ cat test.c
\#include &lt;stdatomic.h&gt;
void f1(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}
void f2(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire);
}
void f3(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o &amp;&amp; llvm-objdum
p -d test.o
$ ./run.sh
       
test.o: file format elf64-bpf
       
Disassembly of section .text:

0000000000000000 &lt;f1&gt;:
       0:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       1:       c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &amp;= w2
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 &lt;f2&gt;:
       3:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       4:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 &lt;f3&gt;:
       6:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       7:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       8:       95 00 00 00 00 00 00 00 exit

The following is another example where return value is used:

$ cat test1.c
\#include &lt;stdatomic.h&gt;
int f1(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}  
int f2(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_acquire);
}  
int f3(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}  
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o &amp;&amp; llvm-objdump -d test1.o
$ ./run.sh

test.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 &lt;f1&gt;:
       0:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       1:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 &lt;f2&gt;:
       3:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       4:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 &lt;f3&gt;:
       6:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       7:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       8:       95 00 00 00 00 00 00 00 exit

You can see that for relaxed memory ordering, if return value is used, atomic_fetch_and()
insn is used. Otherwise, if return value is not used, locked insn is used.

Here is another example with global _Atomic variable:

$ cat test3.c
\#include &lt;stdatomic.h&gt;

_Atomic int i;

void f1(void) {
   (void)__c11_atomic_fetch_and(&amp;i, 10, memory_order_relaxed);
}
void f2(void) {
   (void)__c11_atomic_fetch_and(&amp;i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o &amp;&amp; llvm-objdump -d test3.o
$ ./run.sh

test3.o:        file format elf64-bpf

Disassembly of section .text:

0000000000000000 &lt;f1&gt;:
       0:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       1:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       3:       c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &amp;= w1
       4:       95 00 00 00 00 00 00 00 exit
       
0000000000000028 &lt;f2&gt;:
       5:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       6:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       8:       c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
       9:       95 00 00 00 00 00 00 00 exit

Note that in the above compilations, '-g' is not used. The reason is due to the following IR
related to _Atomic type:

$clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c

The related debug info for test3.c:

!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true)
...
!16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17)
!17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)

If compiling test.c, the related debug info:

...
!19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25)
!20 = !DISubroutineType(types: !21)
!21 = !{null, !22}
!22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64)
!23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24)
!24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
!25 = !{!26}
!26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22)

All the above suggests _Atomic behaves like a modifier (e.g. const, restrict, volatile).
This seems true based on doc [1].

Without proper handling DW_TAG_atomic_type, llvm BTF generation will be incorrect since
the current implementation assumes no existence of DW_TAG_atomic_type. So we have
two choices here:
(1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF encoding.
(2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC behaves as a
modifier like const/volatile/restrict.

For choice (1), the following is a hack which can make '-g' work for test1.c:

diff --git a/llvm/lib/Target/BPF/BTFDebug.cpp b/llvm/lib/Target/BPF/BTFDebug.cpp
index 4d847abea731..fd61bb811111 100644
--- a/llvm/lib/Target/BPF/BTFDebug.cpp
+++ b/llvm/lib/Target/BPF/BTFDebug.cpp
@@ -1444,8 +1444,14 @@ void BTFDebug::processGlobals(bool ProcessingMapDef) {
       DIGlobal = GVE-&gt;getVariable();
       if (SecName.starts_with(".maps"))
         visitMapDefType(DIGlobal-&gt;getType(), GVTypeId);
-      else
-        visitTypeEntry(DIGlobal-&gt;getType(), GVTypeId, false, false);
+      else {
+        const DIType *Ty = DIGlobal-&gt;getType();
+        auto *DTy = dyn_cast&lt;DIDerivedType&gt;(Ty);
+        if (DTy &amp;&amp; DTy-&gt;getTag() == dwarf::DW_TAG_atomic_type)
+          visitTypeEntry(DTy-&gt;getBaseType(), GVTypeId, false, false);
+        else
+          visitTypeEntry(Ty, GVTypeId, false, false);
+      }
       break;
     } 

You can see that basicaly dwarf::DW_TAG_atomic_type is skipped during BTF generation.
Other changes are needed to avoid other usages of dwarf::DW_TAG_atomic_type.

For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and kernel
needs to handle that properly. The main advantage of it probably is to maintain this atomic
type so it is also available to skeleton. But I think for skeleton a raw type might be good
enough unless user space intends to do some atomic operation with that. Do we really
have such cases?

[1] https://lore.kernel.org/bpf/7b941f53-2a05-48ec-9032-8f106face3a3@linux.dev/
[2] https://dwarfstd.org/issues/131112.1.html


Full diff: https://github.com/llvm/llvm-project/pull/107343.diff

4 Files Affected:

  • (modified) clang/lib/Basic/Targets/BPF.cpp (+1)
  • (modified) llvm/lib/Target/BPF/BPFInstrInfo.td (+103-10)
  • (modified) llvm/lib/Target/BPF/BPFMIChecking.cpp (+80-11)
  • (modified) llvm/lib/Target/BPF/BTFDebug.cpp (+19-2)
diff --git a/clang/lib/Basic/Targets/BPF.cpp b/clang/lib/Basic/Targets/BPF.cpp
index a94ceee5a6a5e7..77e3a9388b0c46 100644
--- a/clang/lib/Basic/Targets/BPF.cpp
+++ b/clang/lib/Basic/Targets/BPF.cpp
@@ -37,6 +37,7 @@ void BPFTargetInfo::getTargetDefines(const LangOptions &Opts,
   }
 
   Builder.defineMacro("__BPF_FEATURE_ADDR_SPACE_CAST");
+  Builder.defineMacro("__BPF_FEATURE_ATOMIC_MEM_ORDERING");
 
   if (CPU.empty())
     CPU = "v3";
diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
index f7e17901c7ed5e..68b0d1b70efe20 100644
--- a/llvm/lib/Target/BPF/BPFInstrInfo.td
+++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
@@ -864,26 +864,119 @@ class XFALU32<BPFWidthModifer SizeOp, BPFArithOp Opc, string OpcodeStr,
 
 let Constraints = "$dst = $val" in {
   let Predicates = [BPFHasALU32], DecoderNamespace = "BPFALU32" in {
-    def XFADDW32 : XFALU32<BPF_W, BPF_ADD, "u32", "add", atomic_load_add_i32>;
-    def XFANDW32 : XFALU32<BPF_W, BPF_AND, "u32", "and", atomic_load_and_i32>;
-    def XFORW32  : XFALU32<BPF_W, BPF_OR,  "u32", "or",  atomic_load_or_i32>;
-    def XFXORW32 : XFALU32<BPF_W, BPF_XOR, "u32", "xor", atomic_load_xor_i32>;
+    def XFADDW32 : XFALU32<BPF_W, BPF_ADD, "u32", "add", atomic_load_add_i32_seq_cst>;
+    def XFANDW32 : XFALU32<BPF_W, BPF_AND, "u32", "and", atomic_load_and_i32_seq_cst>;
+    def XFORW32  : XFALU32<BPF_W, BPF_OR,  "u32", "or",  atomic_load_or_i32_seq_cst>;
+    def XFXORW32 : XFALU32<BPF_W, BPF_XOR, "u32", "xor", atomic_load_xor_i32_seq_cst>;
   }
 
   let Predicates = [BPFHasALU32] in {
-    def XFADDD : XFALU64<BPF_DW, BPF_ADD, "u64", "add", atomic_load_add_i64>;
+    def XFADDD : XFALU64<BPF_DW, BPF_ADD, "u64", "add", atomic_load_add_i64_seq_cst>;
   }
-  def XFANDD : XFALU64<BPF_DW, BPF_AND, "u64", "and", atomic_load_and_i64>;
-  def XFORD  : XFALU64<BPF_DW, BPF_OR,  "u64", "or",  atomic_load_or_i64>;
-  def XFXORD : XFALU64<BPF_DW, BPF_XOR, "u64", "xor", atomic_load_xor_i64>;
+  def XFANDD : XFALU64<BPF_DW, BPF_AND, "u64", "and", atomic_load_and_i64_seq_cst>;
+  def XFORD  : XFALU64<BPF_DW, BPF_OR,  "u64", "or",  atomic_load_or_i64_seq_cst>;
+  def XFXORD : XFALU64<BPF_DW, BPF_XOR, "u64", "xor", atomic_load_xor_i64_seq_cst>;
+}
+
+let Predicates = [BPFHasALU32] in {
+    def : Pat<(atomic_load_add_i32_monotonic ADDRri:$addr, GPR32:$val),
+              (XADDW32 ADDRri:$addr, GPR32:$val)>;
+    def : Pat<(atomic_load_add_i32_acquire ADDRri:$addr, GPR32:$val),
+              (XFADDW32 ADDRri:$addr, GPR32:$val)>;
+    def : Pat<(atomic_load_add_i32_release ADDRri:$addr, GPR32:$val),
+              (XFADDW32 ADDRri:$addr, GPR32:$val)>;
+    def : Pat<(atomic_load_add_i32_acq_rel ADDRri:$addr, GPR32:$val),
+              (XFADDW32 ADDRri:$addr, GPR32:$val)>;
+
+    def : Pat<(atomic_load_add_i64_monotonic ADDRri:$addr, GPR:$val),
+              (XADDD ADDRri:$addr, GPR:$val)>;
+    def : Pat<(atomic_load_add_i64_acquire ADDRri:$addr, GPR:$val),
+              (XFADDD ADDRri:$addr, GPR:$val)>;
+    def : Pat<(atomic_load_add_i64_release ADDRri:$addr, GPR:$val),
+              (XFADDD ADDRri:$addr, GPR:$val)>;
+    def : Pat<(atomic_load_add_i64_acq_rel ADDRri:$addr, GPR:$val),
+              (XFADDD ADDRri:$addr, GPR:$val)>;
 }
 
 // atomic_load_sub can be represented as a neg followed
 // by an atomic_load_add.
-def : Pat<(atomic_load_sub_i32 ADDRri:$addr, GPR32:$val),
+// FIXME: the below can probably be simplified.
+def : Pat<(atomic_load_sub_i32_monotonic ADDRri:$addr, GPR32:$val),
+          (XADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
+def : Pat<(atomic_load_sub_i32_acquire ADDRri:$addr, GPR32:$val),
+          (XFADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
+def : Pat<(atomic_load_sub_i32_release ADDRri:$addr, GPR32:$val),
+          (XFADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
+def : Pat<(atomic_load_sub_i32_acq_rel ADDRri:$addr, GPR32:$val),
+          (XFADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
+def : Pat<(atomic_load_sub_i32_seq_cst ADDRri:$addr, GPR32:$val),
           (XFADDW32 ADDRri:$addr, (NEG_32 GPR32:$val))>;
-def : Pat<(atomic_load_sub_i64 ADDRri:$addr, GPR:$val),
+
+def : Pat<(atomic_load_sub_i64_monotonic ADDRri:$addr, GPR:$val),
+          (XADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+def : Pat<(atomic_load_sub_i64_acquire ADDRri:$addr, GPR:$val),
+          (XFADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+def : Pat<(atomic_load_sub_i64_release ADDRri:$addr, GPR:$val),
+          (XFADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+def : Pat<(atomic_load_sub_i64_acq_rel ADDRri:$addr, GPR:$val),
           (XFADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+def : Pat<(atomic_load_sub_i64_seq_cst ADDRri:$addr, GPR:$val),
+          (XFADDD ADDRri:$addr, (NEG_64 GPR:$val))>;
+
+def : Pat<(atomic_load_and_i32_monotonic ADDRri:$addr, GPR32:$val),
+          (XANDW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_and_i32_acquire ADDRri:$addr, GPR32:$val),
+          (XFANDW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_and_i32_release ADDRri:$addr, GPR32:$val),
+          (XFANDW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_and_i32_acq_rel ADDRri:$addr, GPR32:$val),
+          (XFANDW32 ADDRri:$addr, GPR32:$val)>;
+
+
+def : Pat<(atomic_load_and_i64_monotonic ADDRri:$addr, GPR:$val),
+          (XANDD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_and_i64_acquire ADDRri:$addr, GPR:$val),
+          (XFANDD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_and_i64_release ADDRri:$addr, GPR:$val),
+          (XFANDD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_and_i64_acq_rel ADDRri:$addr, GPR:$val),
+          (XFANDD ADDRri:$addr, GPR:$val)>;
+
+def : Pat<(atomic_load_or_i32_monotonic ADDRri:$addr, GPR32:$val),
+          (XORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_or_i32_acquire ADDRri:$addr, GPR32:$val),
+          (XFORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_or_i32_release ADDRri:$addr, GPR32:$val),
+          (XFORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_or_i32_acq_rel ADDRri:$addr, GPR32:$val),
+          (XFORW32 ADDRri:$addr, GPR32:$val)>;
+
+def : Pat<(atomic_load_or_i64_monotonic ADDRri:$addr, GPR:$val),
+          (XORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_or_i64_acquire ADDRri:$addr, GPR:$val),
+          (XFORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_or_i64_release ADDRri:$addr, GPR:$val),
+          (XFORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_or_i64_acq_rel ADDRri:$addr, GPR:$val),
+          (XFORD ADDRri:$addr, GPR:$val)>;
+
+def : Pat<(atomic_load_xor_i32_monotonic ADDRri:$addr, GPR32:$val),
+          (XXORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_xor_i32_acquire ADDRri:$addr, GPR32:$val),
+          (XFXORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_xor_i32_release ADDRri:$addr, GPR32:$val),
+          (XFXORW32 ADDRri:$addr, GPR32:$val)>;
+def : Pat<(atomic_load_xor_i32_acq_rel ADDRri:$addr, GPR32:$val),
+          (XFXORW32 ADDRri:$addr, GPR32:$val)>;
+
+def : Pat<(atomic_load_xor_i64_monotonic ADDRri:$addr, GPR:$val),
+          (XXORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_xor_i64_acquire ADDRri:$addr, GPR:$val),
+          (XFXORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_xor_i64_release ADDRri:$addr, GPR:$val),
+          (XFXORD ADDRri:$addr, GPR:$val)>;
+def : Pat<(atomic_load_xor_i64_acq_rel ADDRri:$addr, GPR:$val),
+          (XFXORD ADDRri:$addr, GPR:$val)>;
 
 // Atomic Exchange
 class XCHG<BPFWidthModifer SizeOp, string OpcodeStr, PatFrag OpNode>
diff --git a/llvm/lib/Target/BPF/BPFMIChecking.cpp b/llvm/lib/Target/BPF/BPFMIChecking.cpp
index 24224f6c1e9e66..6010539d21bad0 100644
--- a/llvm/lib/Target/BPF/BPFMIChecking.cpp
+++ b/llvm/lib/Target/BPF/BPFMIChecking.cpp
@@ -43,14 +43,14 @@ struct BPFMIPreEmitChecking : public MachineFunctionPass {
   // Initialize class variables.
   void initialize(MachineFunction &MFParm);
 
-  void processAtomicInsts();
+  bool processAtomicInsts();
 
 public:
   // Main entry point for this pass.
   bool runOnMachineFunction(MachineFunction &MF) override {
     if (!skipFunction(MF.getFunction())) {
       initialize(MF);
-      processAtomicInsts();
+      return processAtomicInsts();
     }
     return false;
   }
@@ -152,22 +152,91 @@ static bool hasLiveDefs(const MachineInstr &MI, const TargetRegisterInfo *TRI) {
   return false;
 }
 
-void BPFMIPreEmitChecking::processAtomicInsts() {
+bool BPFMIPreEmitChecking::processAtomicInsts() {
+  if (!MF->getSubtarget<BPFSubtarget>().getHasJmp32()) {
+    // Only check for cpu version 1 and 2.
+    for (MachineBasicBlock &MBB : *MF) {
+      for (MachineInstr &MI : MBB) {
+        if (MI.getOpcode() != BPF::XADDW && MI.getOpcode() != BPF::XADDD)
+          continue;
+
+        LLVM_DEBUG(MI.dump());
+        if (hasLiveDefs(MI, TRI)) {
+          DebugLoc Empty;
+          const DebugLoc &DL = MI.getDebugLoc();
+          const Function &F = MF->getFunction();
+          F.getContext().diagnose(DiagnosticInfoUnsupported{
+              F, "Invalid usage of the XADD return value", DL});
+        }
+      }
+    }
+  }
+
+  // Check return values of atomic_fetch_and_{add,and,or,xor}.
+  // If the return is not used, the atomic_fetch_and_<op> instruction
+  // is replaced with atomic_<op> instruction.
+  MachineInstr *ToErase = nullptr;
+  bool Changed = false;
+  const BPFInstrInfo *TII = MF->getSubtarget<BPFSubtarget>().getInstrInfo();
   for (MachineBasicBlock &MBB : *MF) {
     for (MachineInstr &MI : MBB) {
-      if (MI.getOpcode() != BPF::XADDW && MI.getOpcode() != BPF::XADDD)
+      if (ToErase) {
+        ToErase->eraseFromParent();
+        ToErase = nullptr;
+      }
+
+      if (MI.getOpcode() != BPF::XADDW32 && MI.getOpcode() != BPF::XADDD &&
+          MI.getOpcode() != BPF::XANDW32 && MI.getOpcode() != BPF::XANDD &&
+          MI.getOpcode() != BPF::XXORW32 && MI.getOpcode() != BPF::XXORD &&
+          MI.getOpcode() != BPF::XORW32 && MI.getOpcode() != BPF::XORD)
         continue;
 
-      LLVM_DEBUG(MI.dump());
-      if (hasLiveDefs(MI, TRI)) {
-        DebugLoc Empty;
-        const DebugLoc &DL = MI.getDebugLoc();
-        const Function &F = MF->getFunction();
-        F.getContext().diagnose(DiagnosticInfoUnsupported{
-            F, "Invalid usage of the XADD return value", DL});
+      if (!hasLiveDefs(MI, TRI))
+        continue;
+
+      LLVM_DEBUG(dbgs() << "Transforming "; MI.dump());
+      unsigned newOpcode;
+      switch (MI.getOpcode()) {
+      case BPF::XADDW32:
+        newOpcode = BPF::XFADDW32;
+        break;
+      case BPF::XADDD:
+        newOpcode = BPF::XFADDD;
+        break;
+      case BPF::XANDW32:
+        newOpcode = BPF::XFANDW32;
+        break;
+      case BPF::XANDD:
+        newOpcode = BPF::XFANDD;
+        break;
+      case BPF::XXORW32:
+        newOpcode = BPF::XFXORW32;
+        break;
+      case BPF::XXORD:
+        newOpcode = BPF::XFXORD;
+        break;
+      case BPF::XORW32:
+        newOpcode = BPF::XFORW32;
+        break;
+      case BPF::XORD:
+        newOpcode = BPF::XFORD;
+        break;
+      default:
+        llvm_unreachable("Incorrect Atomic Instruction Opcode");
       }
+
+      BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(newOpcode))
+          .add(MI.getOperand(0))
+          .add(MI.getOperand(1))
+          .add(MI.getOperand(2))
+          .add(MI.getOperand(3));
+
+      ToErase = &MI;
+      Changed = true;
     }
   }
+
+  return Changed;
 }
 
 } // namespace
diff --git a/llvm/lib/Target/BPF/BTFDebug.cpp b/llvm/lib/Target/BPF/BTFDebug.cpp
index 4d847abea731dc..1cd82720fa7e81 100644
--- a/llvm/lib/Target/BPF/BTFDebug.cpp
+++ b/llvm/lib/Target/BPF/BTFDebug.cpp
@@ -91,6 +91,12 @@ void BTFTypeDerived::completeType(BTFDebug &BDebug) {
 
   // The base type for PTR/CONST/VOLATILE could be void.
   const DIType *ResolvedType = DTy->getBaseType();
+  if (ResolvedType) {
+    const auto *DerivedTy = dyn_cast<DIDerivedType>(ResolvedType);
+    if (DerivedTy && DerivedTy->getTag() == dwarf::DW_TAG_atomic_type)
+      ResolvedType = DerivedTy->getBaseType();
+  }
+
   if (!ResolvedType) {
     assert((Kind == BTF::BTF_KIND_PTR || Kind == BTF::BTF_KIND_CONST ||
             Kind == BTF::BTF_KIND_VOLATILE) &&
@@ -800,6 +806,10 @@ void BTFDebug::visitDerivedType(const DIDerivedType *DTy, uint32_t &TypeId,
                                 bool CheckPointer, bool SeenPointer) {
   unsigned Tag = DTy->getTag();
 
+  if (Tag == dwarf::DW_TAG_atomic_type)
+    return visitTypeEntry(DTy->getBaseType(), TypeId, CheckPointer,
+                          SeenPointer);
+
   /// Try to avoid chasing pointees, esp. structure pointees which may
   /// unnecessary bring in a lot of types.
   if (CheckPointer && !SeenPointer) {
@@ -1444,8 +1454,15 @@ void BTFDebug::processGlobals(bool ProcessingMapDef) {
       DIGlobal = GVE->getVariable();
       if (SecName.starts_with(".maps"))
         visitMapDefType(DIGlobal->getType(), GVTypeId);
-      else
-        visitTypeEntry(DIGlobal->getType(), GVTypeId, false, false);
+      else {
+        const DIType *Ty = DIGlobal->getType();
+        if (Ty) {
+          auto *DTy = dyn_cast<DIDerivedType>(Ty);
+          if (DTy && DTy->getTag() == dwarf::DW_TAG_atomic_type)
+            Ty = DTy->getBaseType();
+        }
+        visitTypeEntry(Ty, GVTypeId, false, false);
+      }
       break;
     }
 

@yonghong-song
Copy link
Contributor Author

Add change in BTFDebug.cpp to ingore dwarf::DW_TAG_atomic_type. Add some info about bpf prog atomic type usage and what skeleton looks like (see commit message in commit 2).

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
Copy link
Contributor

@eddyz87 eddyz87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with these changes, could you please also add tests?

llvm/lib/Target/BPF/BPFInstrInfo.td Outdated Show resolved Hide resolved
llvm/lib/Target/BPF/BPFInstrInfo.td Outdated Show resolved Hide resolved
llvm/lib/Target/BPF/BTFDebug.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/BPF/BTFDebug.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/BPF/BPFMIChecking.cpp Outdated Show resolved Hide resolved
@yonghong-song
Copy link
Contributor Author

For

I agree with these changes, could you please also add tests?

Sure. Will do.

kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Sep 10, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Sep 10, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Sep 10, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Sep 11, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Sep 11, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
@yonghong-song
Copy link
Contributor Author

This pull request has been used for the merged kernel patch:
https://lore.kernel.org/r/[email protected]

@yonghong-song yonghong-song changed the title [RFC][BPF] Do atomic_fetch_*() pattern matching with memory ordering [BPF] Do atomic_fetch_*() pattern matching with memory ordering Sep 14, 2024
llvm/lib/Target/BPF/BPFMIChecking.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/BPF/BPFMIChecking.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/BPF/BTFDebug.cpp Show resolved Hide resolved
llvm/test/CodeGen/BPF/BTF/atomics.ll Outdated Show resolved Hide resolved
@eddyz87
Copy link
Contributor

eddyz87 commented Sep 16, 2024

@yonghong-song, note: if all three commits remain a part of a single pull request, the are required to be squashed (link). If you want all three to be separate, each has to have it's separate pull request (one branch forked from another).

@yonghong-song
Copy link
Contributor Author

@yonghong-song, note: if all three commits remain a part of a single pull request, the are required to be squashed (link). If you want all three to be separate, each has to have it's separate pull request (one branch forked from another).

I will squash them into a single commit.

@yonghong-song
Copy link
Contributor Author

Upload a new version:

  • better pattern matching (differentiating return vs. no-return) to avoid late optimization hack (from @eddyz87)
  • use python script print_btf.py to dump BTF which is easier to understand (from @eddyz87)

Copy link

github-actions bot commented Sep 24, 2024

✅ With the latest revision this PR passed the Python code formatter.

Yonghong Song added 3 commits September 24, 2024 12:27
For atomic fetch_and_*() operations, do pattern matching with memory ordering
seq_cst, acq_rel, release, acquire and monotonic (relaxed). For fetch_and_*()
operations with seq_cst/acq_rel/release/acquire ordering, atomic_fetch_*()
instructions are generated. For monotonic ordering, locked insns are generated
if return value is not used. Otherwise, atomic_fetch_*() insns are used.
The main motivation is to resolve the kernel issue [1].

The following are memory ordering are supported:
  seq_cst, acq_rel, release, acquire, relaxed
Current gcc style __sync_fetch_and_*() operations are all seq_cst.

To use explicit memory ordering, the _Atomic type is needed.  The following is
an example:

```
$ cat test.c
\#include <stdatomic.h>
void f1(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}
void f2(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire);
}
void f3(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o && llvm-objdump -d test.o
$ ./run.sh

test.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       1:       c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &= w2
       2:       95 00 00 00 00 00 00 00 exit

0000000000000018 <f2>:
       3:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       4:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       5:       95 00 00 00 00 00 00 00 exit

0000000000000030 <f3>:
       6:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       7:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       8:       95 00 00 00 00 00 00 00 exit
```

The following is another example where return value is used:

```
$ cat test1.c
\#include <stdatomic.h>
int f1(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}
int f2(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_acquire);
}
int f3(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o && llvm-objdump -d test1.o
$ ./run.sh

test.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       1:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       2:       95 00 00 00 00 00 00 00 exit

0000000000000018 <f2>:
       3:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       4:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       5:       95 00 00 00 00 00 00 00 exit

0000000000000030 <f3>:
       6:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       7:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       8:       95 00 00 00 00 00 00 00 exit
```

You can see that for relaxed memory ordering, if return value is used, atomic_fetch_and()
insn is used. Otherwise, if return value is not used, locked insn is used.

Here is another example with global _Atomic variable:

```
$ cat test3.c
\#include <stdatomic.h>

_Atomic int i;

void f1(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_relaxed);
}
void f2(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o && llvm-objdump -d test3.o
$ ./run.sh

test3.o:        file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       1:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       3:       c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &= w1
       4:       95 00 00 00 00 00 00 00 exit

0000000000000028 <f2>:
       5:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       6:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       8:       c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
       9:       95 00 00 00 00 00 00 00 exit
```

Note that in the above compilations, '-g' is not used. The reason is due to the following IR
related to _Atomic type:
```
$clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c
```
The related debug info for test3.c:
```
!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true)
...
!16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17)
!17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
```

If compiling test.c, the related debug info:
```
...
!19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25)
!20 = !DISubroutineType(types: !21)
!21 = !{null, !22}
!22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64)
!23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24)
!24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
!25 = !{!26}
!26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22)
```

All the above suggests _Atomic behaves like a modifier (e.g. const, restrict, volatile).
This seems true based on doc [1].

Without proper handling DW_TAG_atomic_type, llvm BTF generation will be incorrect since
the current implementation assumes no existence of DW_TAG_atomic_type. So we have
two choices here:
  (1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF encoding.
  (2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC behaves as a
       modifier like const/volatile/restrict.

For choice (1), llvm bpf backend should skip dwarf::DW_TAG_atomic_type during
BTF generation whenever necessary.

For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and kernel
needs to handle that properly. The main advantage of it probably is to maintain
this atomic type so it is also available to skeleton. But I think for skeleton
a raw type might be good enough unless user space intends to do some atomic
operation with that, which is a unlikely case.

So I choose choice (1) in this implementation.

 [1] https://lore.kernel.org/bpf/[email protected]/
 [2] https://dwarfstd.org/issues/131112.1.html
Make change in BTFDebug.cpp to handle DW_TAG_atomic_type properly.
Otherwise, a type like
   _Atomic int i; // global
the dwarf type chain atomic->int
Since DW_TAG_atomic_type is not processed BTF generation will stop
at atomic modifier and BTF will encode 'i' as void type.

Similar for type like
  volatile _Atomic int *p;
the dwarf type chain ptr->volatile->atomic->int
Since atomic type is not processed and BTF generation will stop at
atomic type, the eventual BTF type will be
  ptr->volatile->void
which is incorrect.

This patch fixed the following cases including the above two patterns
by skipping DW_TAG_atomic_type:
  - global variable with _Atomic type.
  - function parameter and return type with _Atomic type.
  - struct member with _Atomic type.
  - ptr,const,volatile,restrict pointing to a _Atomic type.
  - btf_type_tag where ptr pointing to _Atomic type and btf_type_tag.

With changed llvm, in kernel selftest arena_atomics.c ([1]), the new bpf
code looks like

```
_Atomic __u64 __arena_global and64_value = (0x110ull << 32);
_Atomic __u32 __arena_global and32_value = 0x110;

SEC("raw_tp/sys_enter")
int and(const void *ctx)
{
	...
        __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
        __c11_atomic_fetch_and(&and32_value, 0x011, memory_order_relaxed);
	...

        return 0;
}
```
and compilation is successful.

The skel file arena_atomics.skel.h will be
```
struct arena_atomics__arena {
	...
	__u64 and64_value;
	__u32 and32_value;
	...
} *arena;
```

  [1] https://lore.kernel.org/r/[email protected]
The following test cases are added:
  - all memory ordering and its asm codes with -mcpu=v3
  - all memory ordering and its asm codes with -mcpu=v1
    Note that __c11_atomic_fetch_{sub,and,or,xor} for 32bit won't
    work for -mcpu=v1. Also at -mcpu=v1, no return value
    allowed for 64bit __sync_fetch_and_add.
  - at -mcpu=v1, __c11_atomic_fetch_sub() for 64bit with relaxed
    memory ordering, the xaddd insn will be used so return
    value is not supported. Otherwise, it will work fine
    if return value is not used. This aligns to
    __c11_atomic_fetch_add() for 64bit with relaxed memory
    ordering at -mcpu=v1.
  - BTF test with _Atomic types in different cases.
@yonghong-song
Copy link
Contributor Author

Uploaded new codes to address the missing case like

int _Atomic __attribute__((btf_type_tag("foo"))) *root;

@yonghong-song yonghong-song merged commit 4c4fb6a into llvm:main Sep 24, 2024
10 checks passed
@jemarch
Copy link

jemarch commented Sep 25, 2024 via email

@uweigand
Copy link
Member

Looks like the new test case is failing on SystemZ: https://lab.llvm.org/buildbot/#/builders/42/builds/1192

struct.error: unpack_from requires a buffer of at least 402653196 bytes for unpacking 12 bytes at offset 402653184 (actual buffer size is 479)

At first glance, this might be an endian problem (SystemZ is big-endian; 402653196 is 0x1800000c).

augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024
…#107343)

Three commits in this pull request:
commit 1: implement pattern matching for memory ordering seq_cst,
acq_rel, release, acquire and monotonic. Specially, for monotonic memory
ordering (relaxed memory model), if no return value is used, locked insn
is used.
commit 2: add support to handle dwarf atomic modifier in BTF generation.
Actually atomic modifier is ignored in BTF.
commit 3: add tests for new atomic ordering support and BTF support with
_Atomic type.
I removed RFC tag as now patch sets are in reasonable states.

For atomic fetch_and_*() operations, do pattern matching with memory
ordering
seq_cst, acq_rel, release, acquire and monotonic (relaxed). For
fetch_and_*()
operations with seq_cst/acq_rel/release/acquire ordering,
atomic_fetch_*()
instructions are generated. For monotonic ordering, locked insns are
generated
if return value is not used. Otherwise, atomic_fetch_*() insns are used.
The main motivation is to resolve the kernel issue [1].
   
The following are memory ordering are supported:
  seq_cst, acq_rel, release, acquire, relaxed
Current gcc style __sync_fetch_and_*() operations are all seq_cst.

To use explicit memory ordering, the _Atomic type is needed. The
following is
an example:

```
$ cat test.c
\#include <stdatomic.h>
void f1(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}
void f2(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire);
}
void f3(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o && llvm-objdum
p -d test.o
$ ./run.sh
       
test.o: file format elf64-bpf
       
Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       1:       c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &= w2
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       4:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       7:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       8:       95 00 00 00 00 00 00 00 exit
```    

The following is another example where return value is used:

```
$ cat test1.c
\#include <stdatomic.h>
int f1(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}  
int f2(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_acquire);
}  
int f3(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}  
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o && llvm-objdump -d test1.o
$ ./run.sh

test.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       1:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       4:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       7:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       8:       95 00 00 00 00 00 00 00 exit
```    

You can see that for relaxed memory ordering, if return value is used,
atomic_fetch_and()
insn is used. Otherwise, if return value is not used, locked insn is
used.

Here is another example with global _Atomic variable:

```
$ cat test3.c
\#include <stdatomic.h>

_Atomic int i;

void f1(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_relaxed);
}
void f2(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o && llvm-objdump -d test3.o
$ ./run.sh

test3.o:        file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       1:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       3:       c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &= w1
       4:       95 00 00 00 00 00 00 00 exit
       
0000000000000028 <f2>:
       5:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       6:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       8:       c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
       9:       95 00 00 00 00 00 00 00 exit
```    

Note that in the above compilations, '-g' is not used. The reason is due
to the following IR
related to _Atomic type:
```
$clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c
```
The related debug info for test3.c:
```
!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true)
...
!16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17)
!17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
```

If compiling test.c, the related debug info:
```
...
!19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25)
!20 = !DISubroutineType(types: !21)
!21 = !{null, !22}
!22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64)
!23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24)
!24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
!25 = !{!26}
!26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22)
```

All the above suggests _Atomic behaves like a modifier (e.g. const,
restrict, volatile).
This seems true based on doc [1].

Without proper handling DW_TAG_atomic_type, llvm BTF generation will be
incorrect since
the current implementation assumes no existence of DW_TAG_atomic_type.
So we have
two choices here:
(1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF
encoding.
(2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC
behaves as a
       modifier like const/volatile/restrict.

For choice (1), llvm bpf backend should skip dwarf::DW_TAG_atomic_type
during
BTF generation whenever necessary.

For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and
kernel
needs to handle that properly. The main advantage of it probably is to
maintain
this atomic type so it is also available to skeleton. But I think for
skeleton
a raw type might be good enough unless user space intends to do some
atomic
operation with that, which is a unlikely case.
    
So I choose choice (1) in this RFC implementation. See the commit
message of the second commit for details.

[1]
https://lore.kernel.org/bpf/[email protected]/
 [2] https://dwarfstd.org/issues/131112.1.html

---------
@yonghong-song
Copy link
Contributor Author

yonghong-song commented Sep 27, 2024

Looks like the new test case is failing on SystemZ: https://lab.llvm.org/buildbot/#/builders/42/builds/1192

struct.error: unpack_from requires a buffer of at least 402653196 bytes for unpacking 12 bytes at offset 402653184 (actual buffer size is 479)

At first glance, this might be an endian problem (SystemZ is big-endian; 402653196 is 0x1800000c).

Thanks for reporting! The test run commands:

; RUN: llc -march=bpfel -mcpu=v3 -filetype=obj -o %t1 %s
; RUN: llvm-objcopy --dump-section='.BTF'=%t2 %t1
; RUN: %python %p/print_btf.py %t2 | FileCheck -check-prefixes=CHECK %s

It is possible a print_btf.py issue. Let me investigate.

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 27, 2024

@yonghong-song , @uweigand , the fix is available in #110332

xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
…#107343)

Three commits in this pull request:
commit 1: implement pattern matching for memory ordering seq_cst,
acq_rel, release, acquire and monotonic. Specially, for monotonic memory
ordering (relaxed memory model), if no return value is used, locked insn
is used.
commit 2: add support to handle dwarf atomic modifier in BTF generation.
Actually atomic modifier is ignored in BTF.
commit 3: add tests for new atomic ordering support and BTF support with
_Atomic type.
I removed RFC tag as now patch sets are in reasonable states.

For atomic fetch_and_*() operations, do pattern matching with memory
ordering
seq_cst, acq_rel, release, acquire and monotonic (relaxed). For
fetch_and_*()
operations with seq_cst/acq_rel/release/acquire ordering,
atomic_fetch_*()
instructions are generated. For monotonic ordering, locked insns are
generated
if return value is not used. Otherwise, atomic_fetch_*() insns are used.
The main motivation is to resolve the kernel issue [1].
   
The following are memory ordering are supported:
  seq_cst, acq_rel, release, acquire, relaxed
Current gcc style __sync_fetch_and_*() operations are all seq_cst.

To use explicit memory ordering, the _Atomic type is needed. The
following is
an example:

```
$ cat test.c
\#include <stdatomic.h>
void f1(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}
void f2(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_acquire);
}
void f3(_Atomic int *i) {
   (void)__c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test.c -o test.o && llvm-objdum
p -d test.o
$ ./run.sh
       
test.o: file format elf64-bpf
       
Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       1:       c3 21 00 00 50 00 00 00 lock *(u32 *)(r1 + 0x0) &= w2
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       4:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 02 00 00 0a 00 00 00 w2 = 0xa
       7:       c3 21 00 00 51 00 00 00 w2 = atomic_fetch_and((u32 *)(r1 + 0x0), w2)
       8:       95 00 00 00 00 00 00 00 exit
```    

The following is another example where return value is used:

```
$ cat test1.c
\#include <stdatomic.h>
int f1(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_relaxed);
}  
int f2(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_acquire);
}  
int f3(_Atomic int *i) {
   return __c11_atomic_fetch_and(i, 10, memory_order_seq_cst);
}  
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test1.c -o test1.o && llvm-objdump -d test1.o
$ ./run.sh

test.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       1:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       2:       95 00 00 00 00 00 00 00 exit
       
0000000000000018 <f2>:
       3:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       4:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       5:       95 00 00 00 00 00 00 00 exit
       
0000000000000030 <f3>:
       6:       b4 00 00 00 0a 00 00 00 w0 = 0xa
       7:       c3 01 00 00 51 00 00 00 w0 = atomic_fetch_and((u32 *)(r1 + 0x0), w0)
       8:       95 00 00 00 00 00 00 00 exit
```    

You can see that for relaxed memory ordering, if return value is used,
atomic_fetch_and()
insn is used. Otherwise, if return value is not used, locked insn is
used.

Here is another example with global _Atomic variable:

```
$ cat test3.c
\#include <stdatomic.h>

_Atomic int i;

void f1(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_relaxed);
}
void f2(void) {
   (void)__c11_atomic_fetch_and(&i, 10, memory_order_seq_cst);
}
$ cat run.sh
clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -c test3.c -o test3.o && llvm-objdump -d test3.o
$ ./run.sh

test3.o:        file format elf64-bpf

Disassembly of section .text:

0000000000000000 <f1>:
       0:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       1:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       3:       c3 12 00 00 50 00 00 00 lock *(u32 *)(r2 + 0x0) &= w1
       4:       95 00 00 00 00 00 00 00 exit
       
0000000000000028 <f2>:
       5:       b4 01 00 00 0a 00 00 00 w1 = 0xa
       6:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       8:       c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
       9:       95 00 00 00 00 00 00 00 exit
```    

Note that in the above compilations, '-g' is not used. The reason is due
to the following IR
related to _Atomic type:
```
$clang  -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -O2 --target=bpf -g -S -emit-llvm test3.c
```
The related debug info for test3.c:
```
!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(name: "i", scope: !2, file: !3, line: 3, type: !16, isLocal: false, isDefinition: true)
...
!16 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !17)
!17 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
```

If compiling test.c, the related debug info:
```
...
!19 = distinct !DISubprogram(name: "f1", scope: !1, file: !1, line: 3, type: !20, scopeLine: 3, flags: DIFlagPrototyped | DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !25)
!20 = !DISubroutineType(types: !21)
!21 = !{null, !22}
!22 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !23, size: 64)
!23 = !DIDerivedType(tag: DW_TAG_atomic_type, baseType: !24)
!24 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
!25 = !{!26}
!26 = !DILocalVariable(name: "i", arg: 1, scope: !19, file: !1, line: 3, type: !22)
```

All the above suggests _Atomic behaves like a modifier (e.g. const,
restrict, volatile).
This seems true based on doc [1].

Without proper handling DW_TAG_atomic_type, llvm BTF generation will be
incorrect since
the current implementation assumes no existence of DW_TAG_atomic_type.
So we have
two choices here:
(1). llvm bpf backend processes DW_TAG_atomic_type but ignores it in BTF
encoding.
(2). Add another type, e.g., BTF_KIND_ATOMIC to BTF. BTF_KIND_ATOMIC
behaves as a
       modifier like const/volatile/restrict.

For choice (1), llvm bpf backend should skip dwarf::DW_TAG_atomic_type
during
BTF generation whenever necessary.

For choice (2), BTF_KIND_ATOMIC will be added to BTF so llvm backend and
kernel
needs to handle that properly. The main advantage of it probably is to
maintain
this atomic type so it is also available to skeleton. But I think for
skeleton
a raw type might be good enough unless user space intends to do some
atomic
operation with that, which is a unlikely case.
    
So I choose choice (1) in this RFC implementation. See the commit
message of the second commit for details.

[1]
https://lore.kernel.org/bpf/[email protected]/
 [2] https://dwarfstd.org/issues/131112.1.html

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:codegen clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category debuginfo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants