Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BPF] Add load-acquire and store-release instructions under -mcpu=v4 #108636

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

peilin-ye
Copy link
Contributor

@peilin-ye peilin-ye commented Sep 13, 2024

Per policy, this PR will be split into multiple one-commit-per-PR PRs before getting merged. Please check each commit message for details.


As discussed in [1], introduce BPF instructions with load-acquire and
store-release semantics under -mcpu=v4.

[1] https://lore.kernel.org/all/[email protected]/

@peilin-ye peilin-ye marked this pull request as ready for review September 16, 2024 20:09
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" llvm:binary-utilities labels Sep 16, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 16, 2024

@llvm/pr-subscribers-llvm-binary-utilities

@llvm/pr-subscribers-clang

Author: Peilin Ye (peilin-ye)

Changes

As discussed in [1], introduce BPF instructions with load-acquire and
store-release semantics under -mcpu=v5.

A "load_acquire" is a BPF_LDX instruction with a new mode modifier,
BPF_MEMACQ ("acquiring atomic load"). Similarly, a "store_release" is a
BPF_STX instruction with another new mode modifier, BPF_MEMREL
("releasing atomic store").

BPF_MEMACQ and BPF_MEMREL share the same numeric value, 0x7 (or 0b111).
For example:

long foo(long *ptr) {
return __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
}

foo() can be compiled to:

f9 10 00 00 00 00 00 00 r0 = load_acquire((u64 *)(r1 + 0x0))
95 00 00 00 00 00 00 00 exit

Opcode 0xf9, or 0b11111001, can be decoded as:

0b 111 11 001
BPF_MEMACQ BPF_DW BPF_LDX

Similarly:

void bar(short *ptr, short val) {
__atomic_store_n(ptr, val, __ATOMIC_RELEASE);
}

bar() can be compiled to:

eb 21 00 00 00 00 00 00 store_release((u16 *)(r1 + 0x0), w2)
95 00 00 00 00 00 00 00 exit

Opcode 0xeb, or 0b11101011, can be decoded as:

0b 111 01 011
BPF_MEMREL BPF_H BPF_STX

Inline assembly is also supported. For example:

asm volatile("%0 = load_acquire((u64 *)(%1 + 0x0))" :
"=r"(ret) : "r"(ptr) : "memory");

Let 'llvm-objdump -d' use -mcpu=v5 by default, just like commit
0395868 ("[BPF] Make llvm-objdump disasm default cpu v4
(#102166)").

Add two macros, __BPF_FEATURE_LOAD_ACQUIRE and
__BPF_FEATURE_STORE_RELEASE, to let developers detect these new features
in source code. They can also be disabled using two new llc options,
-disable-load-acquire and -disable-store-release, respectively.

Refactored BPFSubtarget::initSubtargetFeatures(), as well as LOAD,
STORE, LOAD32 and STORE32 classes in BPFInstrInfo.td, to make this
easier.

[1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/


Patch is 21.96 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108636.diff

14 Files Affected:

  • (modified) clang/lib/Basic/Targets/BPF.cpp (+7-2)
  • (modified) clang/lib/Basic/Targets/BPF.h (+1-1)
  • (modified) clang/test/Misc/target-invalid-cpu-note/bpf.c (+1)
  • (modified) llvm/lib/Object/ELFObjectFile.cpp (+1-1)
  • (modified) llvm/lib/Target/BPF/AsmParser/BPFAsmParser.cpp (+2)
  • (modified) llvm/lib/Target/BPF/BPF.td (+1)
  • (modified) llvm/lib/Target/BPF/BPFInstrFormats.td (+2)
  • (modified) llvm/lib/Target/BPF/BPFInstrInfo.td (+69-18)
  • (modified) llvm/lib/Target/BPF/BPFMISimplifyPatchable.cpp (+8-4)
  • (modified) llvm/lib/Target/BPF/BPFSubtarget.cpp (+21-12)
  • (modified) llvm/lib/Target/BPF/BPFSubtarget.h (+5)
  • (modified) llvm/lib/Target/BPF/Disassembler/BPFDisassembler.cpp (+5-2)
  • (added) llvm/test/CodeGen/BPF/acquire-release.ll (+95)
  • (added) llvm/test/CodeGen/BPF/assembler-disassembler-v5.s (+22)
diff --git a/clang/lib/Basic/Targets/BPF.cpp b/clang/lib/Basic/Targets/BPF.cpp
index a94ceee5a6a5e7..7a381e4853093a 100644
--- a/clang/lib/Basic/Targets/BPF.cpp
+++ b/clang/lib/Basic/Targets/BPF.cpp
@@ -67,10 +67,15 @@ void BPFTargetInfo::getTargetDefines(const LangOptions &Opts,
     Builder.defineMacro("__BPF_FEATURE_GOTOL");
     Builder.defineMacro("__BPF_FEATURE_ST");
   }
+
+  if (CpuVerNum >= 5) {
+    Builder.defineMacro("__BPF_FEATURE_LOAD_ACQUIRE");
+    Builder.defineMacro("__BPF_FEATURE_STORE_RELEASE");
+  }
 }
 
-static constexpr llvm::StringLiteral ValidCPUNames[] = {"generic", "v1", "v2",
-                                                        "v3", "v4", "probe"};
+static constexpr llvm::StringLiteral ValidCPUNames[] = {
+    "generic", "v1", "v2", "v3", "v4", "v5", "probe"};
 
 bool BPFTargetInfo::isValidCPUName(StringRef Name) const {
   return llvm::is_contained(ValidCPUNames, Name);
diff --git a/clang/lib/Basic/Targets/BPF.h b/clang/lib/Basic/Targets/BPF.h
index d19b37dd4df7a7..3ca9c07f955f9b 100644
--- a/clang/lib/Basic/Targets/BPF.h
+++ b/clang/lib/Basic/Targets/BPF.h
@@ -106,7 +106,7 @@ class LLVM_LIBRARY_VISIBILITY BPFTargetInfo : public TargetInfo {
   void fillValidCPUList(SmallVectorImpl<StringRef> &Values) const override;
 
   bool setCPU(const std::string &Name) override {
-    if (Name == "v3" || Name == "v4") {
+    if (Name == "v3" || Name == "v4" || Name == "v5") {
       HasAlu32 = true;
     }
 
diff --git a/clang/test/Misc/target-invalid-cpu-note/bpf.c b/clang/test/Misc/target-invalid-cpu-note/bpf.c
index fe925f86cdd137..7f90e64ab16f16 100644
--- a/clang/test/Misc/target-invalid-cpu-note/bpf.c
+++ b/clang/test/Misc/target-invalid-cpu-note/bpf.c
@@ -10,6 +10,7 @@
 // CHECK-SAME: {{^}}, v2
 // CHECK-SAME: {{^}}, v3
 // CHECK-SAME: {{^}}, v4
+// CHECK-SAME: {{^}}, v5
 // CHECK-SAME: {{^}}, probe
 // CHECK-SAME: {{$}}
 
diff --git a/llvm/lib/Object/ELFObjectFile.cpp b/llvm/lib/Object/ELFObjectFile.cpp
index f79c233d93fe8e..408af398e4ef17 100644
--- a/llvm/lib/Object/ELFObjectFile.cpp
+++ b/llvm/lib/Object/ELFObjectFile.cpp
@@ -442,7 +442,7 @@ std::optional<StringRef> ELFObjectFileBase::tryGetCPUName() const {
   case ELF::EM_PPC64:
     return StringRef("future");
   case ELF::EM_BPF:
-    return StringRef("v4");
+    return StringRef("v5");
   default:
     return std::nullopt;
   }
diff --git a/llvm/lib/Target/BPF/AsmParser/BPFAsmParser.cpp b/llvm/lib/Target/BPF/AsmParser/BPFAsmParser.cpp
index 9672ed009e9be1..4721097b765a08 100644
--- a/llvm/lib/Target/BPF/AsmParser/BPFAsmParser.cpp
+++ b/llvm/lib/Target/BPF/AsmParser/BPFAsmParser.cpp
@@ -237,6 +237,7 @@ struct BPFOperand : public MCParsedAsmOperand {
         .Case("exit", true)
         .Case("lock", true)
         .Case("ld_pseudo", true)
+        .Case("store_release", true)
         .Default(false);
   }
 
@@ -273,6 +274,7 @@ struct BPFOperand : public MCParsedAsmOperand {
         .Case("cmpxchg_64", true)
         .Case("cmpxchg32_32", true)
         .Case("addr_space_cast", true)
+        .Case("load_acquire", true)
         .Default(false);
   }
 };
diff --git a/llvm/lib/Target/BPF/BPF.td b/llvm/lib/Target/BPF/BPF.td
index dff76ca07af511..dd2d4989561bc3 100644
--- a/llvm/lib/Target/BPF/BPF.td
+++ b/llvm/lib/Target/BPF/BPF.td
@@ -32,6 +32,7 @@ def : Proc<"v1", []>;
 def : Proc<"v2", []>;
 def : Proc<"v3", [ALU32]>;
 def : Proc<"v4", [ALU32]>;
+def : Proc<"v5", [ALU32]>;
 def : Proc<"probe", []>;
 
 def BPFInstPrinter : AsmWriter {
diff --git a/llvm/lib/Target/BPF/BPFInstrFormats.td b/llvm/lib/Target/BPF/BPFInstrFormats.td
index feffdbc69465ea..cb68b0d1250e6e 100644
--- a/llvm/lib/Target/BPF/BPFInstrFormats.td
+++ b/llvm/lib/Target/BPF/BPFInstrFormats.td
@@ -94,6 +94,8 @@ def BPF_IND  : BPFModeModifer<0x2>;
 def BPF_MEM  : BPFModeModifer<0x3>;
 def BPF_MEMSX  : BPFModeModifer<0x4>;
 def BPF_ATOMIC : BPFModeModifer<0x6>;
+def BPF_MEMACQ : BPFModeModifer<0x7>;
+def BPF_MEMREL : BPFModeModifer<0x7>;
 
 class BPFAtomicFlag<bits<4> val> {
   bits<4> Value = val;
diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
index f7e17901c7ed5e..79de3e77b2b123 100644
--- a/llvm/lib/Target/BPF/BPFInstrInfo.td
+++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
@@ -60,6 +60,8 @@ def BPFHasSdivSmod : Predicate<"Subtarget->hasSdivSmod()">;
 def BPFNoMovsx : Predicate<"!Subtarget->hasMovsx()">;
 def BPFNoBswap : Predicate<"!Subtarget->hasBswap()">;
 def BPFHasStoreImm : Predicate<"Subtarget->hasStoreImm()">;
+def BPFHasLoadAcquire : Predicate<"Subtarget->hasLoadAcquire()">;
+def BPFHasStoreRelease : Predicate<"Subtarget->hasStoreRelease()">;
 
 class ImmediateAsmOperand<string name> : AsmOperandClass {
   let Name = name;
@@ -497,12 +499,11 @@ def LD_pseudo
 }
 
 // STORE instructions
-class STORE<BPFWidthModifer SizeOp, string OpcodeStr, list<dag> Pattern>
-    : TYPE_LD_ST<BPF_MEM.Value, SizeOp.Value,
+class STORE<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string AsmString, list<dag> Pattern>
+    : TYPE_LD_ST<ModOp.Value, SizeOp.Value,
                  (outs),
                  (ins GPR:$src, MEMri:$addr),
-                 "*("#OpcodeStr#" *)($addr) = $src",
-                 Pattern> {
+                 AsmString, Pattern> {
   bits<4> src;
   bits<20> addr;
 
@@ -513,7 +514,11 @@ class STORE<BPFWidthModifer SizeOp, string OpcodeStr, list<dag> Pattern>
 }
 
 class STOREi64<BPFWidthModifer Opc, string OpcodeStr, PatFrag OpNode>
-    : STORE<Opc, OpcodeStr, [(OpNode GPR:$src, ADDRri:$addr)]>;
+    : STORE<Opc, BPF_MEM, "*("#OpcodeStr#" *)($addr) = $src", [(OpNode GPR:$src, ADDRri:$addr)]>;
+
+class STORE_RELEASEi64<BPFWidthModifer Opc, string OpcodeStr, PatFrag OpNode>
+    : STORE<Opc, BPF_MEMREL, "store_release(("#OpcodeStr#" *)($addr), $src)",
+            [(OpNode GPR:$src, ADDRri:$addr)]>;
 
 let Predicates = [BPFNoALU32] in {
   def STW : STOREi64<BPF_W, "u32", truncstorei32>;
@@ -522,6 +527,16 @@ let Predicates = [BPFNoALU32] in {
 }
 def STD : STOREi64<BPF_DW, "u64", store>;
 
+class releasing_store<PatFrag base>
+  : PatFrag<(ops node:$val, node:$ptr), (base node:$val, node:$ptr)> {
+  let IsAtomic = 1;
+  let IsAtomicOrderingRelease = 1;
+}
+
+let Predicates = [BPFHasStoreRelease] in {
+  def STDREL : STORE_RELEASEi64<BPF_DW, "u64", releasing_store<atomic_store_64>>;
+}
+
 class STORE_imm<BPFWidthModifer SizeOp,
                 string OpcodeStr, dag Pattern>
     : TYPE_LD_ST<BPF_MEM.Value, SizeOp.Value,
@@ -567,12 +582,11 @@ let Predicates = [BPFHasALU32, BPFHasStoreImm] in {
 }
 
 // LOAD instructions
-class LOAD<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string OpcodeStr, list<dag> Pattern>
+class LOAD<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string AsmString, list<dag> Pattern>
     : TYPE_LD_ST<ModOp.Value, SizeOp.Value,
                  (outs GPR:$dst),
                  (ins MEMri:$addr),
-                 "$dst = *("#OpcodeStr#" *)($addr)",
-                 Pattern> {
+                 AsmString, Pattern> {
   bits<4> dst;
   bits<20> addr;
 
@@ -583,7 +597,13 @@ class LOAD<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string OpcodeStr, list<
 }
 
 class LOADi64<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string OpcodeStr, PatFrag OpNode>
-    : LOAD<SizeOp, ModOp, OpcodeStr, [(set i64:$dst, (OpNode ADDRri:$addr))]>;
+    : LOAD<SizeOp, ModOp, "$dst = *("#OpcodeStr#" *)($addr)",
+           [(set i64:$dst, (OpNode ADDRri:$addr))]>;
+
+class LOAD_ACQUIREi64<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string OpcodeStr,
+                      PatFrag OpNode>
+    : LOAD<SizeOp, ModOp, "$dst = load_acquire(("#OpcodeStr#" *)($addr))",
+           [(set i64:$dst, (OpNode ADDRri:$addr))]>;
 
 let isCodeGenOnly = 1 in {
   class CORE_LD<RegisterClass RegClass, string Sz>
@@ -622,6 +642,16 @@ let Predicates = [BPFHasLdsx] in {
 
 def LDD : LOADi64<BPF_DW, BPF_MEM, "u64", load>;
 
+class acquiring_load<PatFrags base>
+    : PatFrag<(ops node:$ptr), (base node:$ptr)> {
+  let IsAtomic = 1;
+  let IsAtomicOrderingAcquire = 1;
+}
+
+let Predicates = [BPFHasLoadAcquire] in {
+  def LDDACQ : LOAD_ACQUIREi64<BPF_DW, BPF_MEMACQ, "u64", acquiring_load<atomic_load_64>>;
+}
+
 class BRANCH<BPFJumpOp Opc, string OpcodeStr, list<dag> Pattern>
     : TYPE_ALU_JMP<Opc.Value, BPF_K.Value,
                    (outs),
@@ -1069,12 +1099,11 @@ def : Pat<(i32 (trunc GPR:$src)),
 def : Pat<(i64 (anyext GPR32:$src)),
           (INSERT_SUBREG (i64 (IMPLICIT_DEF)), GPR32:$src, sub_32)>;
 
-class STORE32<BPFWidthModifer SizeOp, string OpcodeStr, list<dag> Pattern>
-    : TYPE_LD_ST<BPF_MEM.Value, SizeOp.Value,
+class STORE32<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string AsmString, list<dag> Pattern>
+    : TYPE_LD_ST<ModOp.Value, SizeOp.Value,
                  (outs),
                  (ins GPR32:$src, MEMri:$addr),
-                 "*("#OpcodeStr#" *)($addr) = $src",
-                 Pattern> {
+                 AsmString, Pattern> {
   bits<4> src;
   bits<20> addr;
 
@@ -1085,20 +1114,30 @@ class STORE32<BPFWidthModifer SizeOp, string OpcodeStr, list<dag> Pattern>
 }
 
 class STOREi32<BPFWidthModifer Opc, string OpcodeStr, PatFrag OpNode>
-    : STORE32<Opc, OpcodeStr, [(OpNode GPR32:$src, ADDRri:$addr)]>;
+    : STORE32<Opc, BPF_MEM, "*("#OpcodeStr#" *)($addr) = $src",
+              [(OpNode GPR32:$src, ADDRri:$addr)]>;
+
+class STORE_RELEASEi32<BPFWidthModifer Opc, string OpcodeStr, PatFrag OpNode>
+    : STORE32<Opc, BPF_MEMREL, "store_release(("#OpcodeStr#" *)($addr), $src)",
+              [(OpNode GPR32:$src, ADDRri:$addr)]>;
 
 let Predicates = [BPFHasALU32], DecoderNamespace = "BPFALU32" in {
   def STW32 : STOREi32<BPF_W, "u32", store>;
   def STH32 : STOREi32<BPF_H, "u16", truncstorei16>;
   def STB32 : STOREi32<BPF_B, "u8", truncstorei8>;
+
+  let Predicates = [BPFHasStoreRelease] in {
+    def STWREL32 : STORE_RELEASEi32<BPF_W, "u32", releasing_store<atomic_store_32>>;
+    def STHREL32 : STORE_RELEASEi32<BPF_H, "u16", releasing_store<atomic_store_16>>;
+    def STBREL32 : STORE_RELEASEi32<BPF_B, "u8", releasing_store<atomic_store_8>>;
+  }
 }
 
-class LOAD32<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string OpcodeStr, list<dag> Pattern>
+class LOAD32<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string AsmString, list<dag> Pattern>
     : TYPE_LD_ST<ModOp.Value, SizeOp.Value,
                 (outs GPR32:$dst),
                 (ins MEMri:$addr),
-                "$dst = *("#OpcodeStr#" *)($addr)",
-                Pattern> {
+                AsmString, Pattern> {
   bits<4> dst;
   bits<20> addr;
 
@@ -1109,12 +1148,24 @@ class LOAD32<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string OpcodeStr, lis
 }
 
 class LOADi32<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string OpcodeStr, PatFrag OpNode>
-    : LOAD32<SizeOp, ModOp, OpcodeStr, [(set i32:$dst, (OpNode ADDRri:$addr))]>;
+    : LOAD32<SizeOp, ModOp, "$dst = *("#OpcodeStr#" *)($addr)",
+             [(set i32:$dst, (OpNode ADDRri:$addr))]>;
+
+class LOAD_ACQUIREi32<BPFWidthModifer SizeOp, BPFModeModifer ModOp, string OpcodeStr,
+                      PatFrag OpNode>
+    : LOAD32<SizeOp, ModOp, "$dst = load_acquire(("#OpcodeStr#" *)($addr))",
+             [(set i32:$dst, (OpNode ADDRri:$addr))]>;
 
 let Predicates = [BPFHasALU32], DecoderNamespace = "BPFALU32" in {
   def LDW32 : LOADi32<BPF_W, BPF_MEM, "u32", load>;
   def LDH32 : LOADi32<BPF_H, BPF_MEM, "u16", zextloadi16>;
   def LDB32 : LOADi32<BPF_B, BPF_MEM, "u8", zextloadi8>;
+
+  let Predicates = [BPFHasLoadAcquire] in {
+    def LDWACQ32 : LOAD_ACQUIREi32<BPF_W, BPF_MEMACQ, "u32", acquiring_load<atomic_load_32>>;
+    def LDHACQ32 : LOAD_ACQUIREi32<BPF_H, BPF_MEMACQ, "u16", acquiring_load<atomic_load_az_16>>;
+    def LDBACQ32 : LOAD_ACQUIREi32<BPF_B, BPF_MEMACQ, "u8", acquiring_load<atomic_load_az_8>>;
+  }
 }
 
 let Predicates = [BPFHasALU32] in {
diff --git a/llvm/lib/Target/BPF/BPFMISimplifyPatchable.cpp b/llvm/lib/Target/BPF/BPFMISimplifyPatchable.cpp
index 39390e8c38f8c1..0763550cb03419 100644
--- a/llvm/lib/Target/BPF/BPFMISimplifyPatchable.cpp
+++ b/llvm/lib/Target/BPF/BPFMISimplifyPatchable.cpp
@@ -100,21 +100,25 @@ static bool isST(unsigned Opcode) {
 }
 
 static bool isSTX32(unsigned Opcode) {
-  return Opcode == BPF::STB32 || Opcode == BPF::STH32 || Opcode == BPF::STW32;
+  return Opcode == BPF::STB32 || Opcode == BPF::STH32 || Opcode == BPF::STW32 ||
+         Opcode == BPF::STBREL32 || Opcode == BPF::STHREL32 ||
+         Opcode == BPF::STWREL32;
 }
 
 static bool isSTX64(unsigned Opcode) {
   return Opcode == BPF::STB || Opcode == BPF::STH || Opcode == BPF::STW ||
-         Opcode == BPF::STD;
+         Opcode == BPF::STD || Opcode == BPF::STDREL;
 }
 
 static bool isLDX32(unsigned Opcode) {
-  return Opcode == BPF::LDB32 || Opcode == BPF::LDH32 || Opcode == BPF::LDW32;
+  return Opcode == BPF::LDB32 || Opcode == BPF::LDH32 || Opcode == BPF::LDW32 ||
+         Opcode == BPF::LDBACQ32 || Opcode == BPF::LDHACQ32 ||
+         Opcode == BPF::LDWACQ32;
 }
 
 static bool isLDX64(unsigned Opcode) {
   return Opcode == BPF::LDB || Opcode == BPF::LDH || Opcode == BPF::LDW ||
-         Opcode == BPF::LDD;
+         Opcode == BPF::LDD || Opcode == BPF::LDDACQ;
 }
 
 static bool isLDSX(unsigned Opcode) {
diff --git a/llvm/lib/Target/BPF/BPFSubtarget.cpp b/llvm/lib/Target/BPF/BPFSubtarget.cpp
index 305e9a2bf2cda3..a52c39efccedcb 100644
--- a/llvm/lib/Target/BPF/BPFSubtarget.cpp
+++ b/llvm/lib/Target/BPF/BPFSubtarget.cpp
@@ -40,6 +40,12 @@ static cl::opt<bool> Disable_gotol("disable-gotol", cl::Hidden, cl::init(false),
 static cl::opt<bool>
     Disable_StoreImm("disable-storeimm", cl::Hidden, cl::init(false),
                      cl::desc("Disable BPF_ST (immediate store) insn"));
+static cl::opt<bool>
+    Disable_load_acquire("disable-load-acquire", cl::Hidden, cl::init(false),
+                         cl::desc("Disable load-acquire insns"));
+static cl::opt<bool>
+    Disable_store_release("disable-store-release", cl::Hidden, cl::init(false),
+                          cl::desc("Disable store-release insns"));
 
 void BPFSubtarget::anchor() {}
 
@@ -62,6 +68,8 @@ void BPFSubtarget::initializeEnvironment() {
   HasSdivSmod = false;
   HasGotol = false;
   HasStoreImm = false;
+  HasLoadAcquire = false;
+  HasStoreRelease = false;
 }
 
 void BPFSubtarget::initSubtargetFeatures(StringRef CPU, StringRef FS) {
@@ -69,29 +77,30 @@ void BPFSubtarget::initSubtargetFeatures(StringRef CPU, StringRef FS) {
     CPU = "v3";
   if (CPU == "probe")
     CPU = sys::detail::getHostCPUNameForBPF();
-  if (CPU == "generic" || CPU == "v1")
-    return;
-  if (CPU == "v2") {
-    HasJmpExt = true;
+  if (CPU.empty() || CPU == "generic" || CPU == "v1")
     return;
-  }
-  if (CPU == "v3") {
+
+  int CpuVerNum = CPU.back() - '0';
+  if (CpuVerNum >= 2)
     HasJmpExt = true;
+
+  if (CpuVerNum >= 3) {
     HasJmp32 = true;
     HasAlu32 = true;
-    return;
   }
-  if (CPU == "v4") {
-    HasJmpExt = true;
-    HasJmp32 = true;
-    HasAlu32 = true;
+
+  if (CpuVerNum >= 4) {
     HasLdsx = !Disable_ldsx;
     HasMovsx = !Disable_movsx;
     HasBswap = !Disable_bswap;
     HasSdivSmod = !Disable_sdiv_smod;
     HasGotol = !Disable_gotol;
     HasStoreImm = !Disable_StoreImm;
-    return;
+  }
+
+  if (CpuVerNum >= 5) {
+    HasLoadAcquire = !Disable_load_acquire;
+    HasStoreRelease = !Disable_store_release;
   }
 }
 
diff --git a/llvm/lib/Target/BPF/BPFSubtarget.h b/llvm/lib/Target/BPF/BPFSubtarget.h
index 33747546eadc3b..aa7995c3af5ecf 100644
--- a/llvm/lib/Target/BPF/BPFSubtarget.h
+++ b/llvm/lib/Target/BPF/BPFSubtarget.h
@@ -66,6 +66,9 @@ class BPFSubtarget : public BPFGenSubtargetInfo {
   // whether cpu v4 insns are enabled.
   bool HasLdsx, HasMovsx, HasBswap, HasSdivSmod, HasGotol, HasStoreImm;
 
+  // whether cpu v5 insns are enabled.
+  bool HasLoadAcquire, HasStoreRelease;
+
   std::unique_ptr<CallLowering> CallLoweringInfo;
   std::unique_ptr<InstructionSelector> InstSelector;
   std::unique_ptr<LegalizerInfo> Legalizer;
@@ -92,6 +95,8 @@ class BPFSubtarget : public BPFGenSubtargetInfo {
   bool hasSdivSmod() const { return HasSdivSmod; }
   bool hasGotol() const { return HasGotol; }
   bool hasStoreImm() const { return HasStoreImm; }
+  bool hasLoadAcquire() const { return HasLoadAcquire; }
+  bool hasStoreRelease() const { return HasStoreRelease; }
 
   bool isLittleEndian() const { return IsLittleEndian; }
 
diff --git a/llvm/lib/Target/BPF/Disassembler/BPFDisassembler.cpp b/llvm/lib/Target/BPF/Disassembler/BPFDisassembler.cpp
index 536bee5393843a..d1b9769d3bed96 100644
--- a/llvm/lib/Target/BPF/Disassembler/BPFDisassembler.cpp
+++ b/llvm/lib/Target/BPF/Disassembler/BPFDisassembler.cpp
@@ -58,7 +58,9 @@ class BPFDisassembler : public MCDisassembler {
     BPF_IND = 0x2,
     BPF_MEM = 0x3,
     BPF_MEMSX = 0x4,
-    BPF_ATOMIC = 0x6
+    BPF_ATOMIC = 0x6,
+    BPF_MEMACQ = 0x7,
+    BPF_MEMREL = 0x7
   };
 
   BPFDisassembler(const MCSubtargetInfo &STI, MCContext &Ctx)
@@ -177,7 +179,8 @@ DecodeStatus BPFDisassembler::getInstruction(MCInst &Instr, uint64_t &Size,
   uint8_t InstMode = getInstMode(Insn);
   if ((InstClass == BPF_LDX || InstClass == BPF_STX) &&
       getInstSize(Insn) != BPF_DW &&
-      (InstMode == BPF_MEM || InstMode == BPF_ATOMIC) &&
+      (InstMode == BPF_MEM || InstMode == BPF_ATOMIC ||
+       InstMode == BPF_MEMACQ /* or BPF_MEMREL */) &&
       STI.hasFeature(BPF::ALU32))
     Result = decodeInstruction(DecoderTableBPFALU3264, Instr, Insn, Address,
                                this, STI);
diff --git a/llvm/test/CodeGen/BPF/acquire-release.ll b/llvm/test/CodeGen/BPF/acquire-release.ll
new file mode 100644
index 00000000000000..1c7db417b6c43f
--- /dev/null
+++ b/llvm/test/CodeGen/BPF/acquire-release.ll
@@ -0,0 +1,95 @@
+; RUN: llc < %s -march=bpfel -mcpu=v5 -verify-machineinstrs -show-mc-encoding | FileCheck %s
+;
+; Source:
+;   char load_acquire_i8(char *p) {
+;     return __atomic_load_n(p, __ATOMIC_ACQUIRE);
+;   }
+;   short load_acquire_i16(short *p) {
+;     return __atomic_load_n(p, __ATOMIC_ACQUIRE);
+;   }
+;   int load_acquire_i32(int *p) {
+;     return __atomic_load_n(p, __ATOMIC_ACQUIRE);
+;   }
+;   long load_acquire_i64(long *p) {
+;     return __atomic_load_n(p, __ATOMIC_ACQUIRE);
+;   }
+;   void store_release_i8(char *p, char v) {
+;     __atomic_store_n(p, v, __ATOMIC_RELEASE);
+;   }
+;   void store_release_i16(short *p, short v) {
+;     __atomic_store_n(p, v, __ATOMIC_RELEASE);
+;   }
+;   void store_release_i32(int *p, int v) {
+;     __atomic_store_n(p, v, __ATOMIC_RELEASE);
+;   }
+;   void store_release_i64(long *p, long v) {
+;     __atomic_store_n(p, v, __ATOMIC_RELEASE);
+;   }
+
+; CHECK-LABEL: load_acquire_i8
+; CHECK: w0 = load_acquire((u8 *)(r1 + 0)) # encoding: [0xf1,0x10,0x00,0x00,0x00,0x00,0x00,0x00]
+define dso_local i8 @load_acquire_i8(ptr nocapture noundef readonly %p) local_unnamed_addr {
+entry:
+  %0 = load atomic i8, ptr %p acquire, align 1
+  ret i8 %0
+}
+
+; CHECK-LABEL: load_acquire_i16
+; CHECK: w0 = load_acquire((u16 *)(r1 + 0)) # encoding: [0xe9,0x10,0x00,0x00,0x00,0x00,0x00,0x00]
+define dso_local i16 @load_acquire_i16(ptr nocapture noundef readonly %p) local_unnamed_addr {
+entry:
+  %0 = load atomic i16, ptr %p acquire, align 2
+  ret i16 %0
+}
+
+; CHECK-LABEL: load_acquire_i32
+; CHECK: w0 = load_acquire((u32 *)(r1 + 0)) # encoding: [0xe1,0x10,0x00,0x00,0x00,0x00,0x00,0x00]
+define dso_local i32 @load_acquire_i32(ptr nocapture noundef readonly %p) local_unnamed_addr {
+entry:
+  %0 = load atomic i32, ptr %p acquire, align 4
+  ret i32 %0
+}
+
+; CHECK-LABEL: load_acquire_i64
+; CHECK: r0 = load_acquire((u64 *)(r1 + 0)) # encoding: [0xf9,0x10,0x00,0x00,0x00,0x00,0x00,0x00]
+define dso_local i64 @load_acquire_i64(ptr nocapture noundef readonly %p) local_unnamed_addr {
+entry:
+  %0 = load atomic i64, ptr %p acquire, align 8
+  ret i64 %0
+}
+
+; CHECK-LABEL: store_release_i8
+; CHECK: store_release((u8 *)(r1 + 0), w2) # encoding: [0xf3,0x21,0x00,0x00,0x00,0x00,0x00,0x00]
+define void @store_release_i8(ptr nocapture noundef writeonly %p,
+                              i8 noundef signext %v) local_unnamed_addr {
+entry:
+  store atomic i8 %v, ptr %p release, align 1
+  ret void
+}
+
+; CHECK-LABEL: store_release_i16
+; CHECK: store_release((u16 *)(r1 + 0), w2) # encoding: [0xeb,0x21,0x00,0x00,0x00,0x00,0x00,0x00]
+define void @store_release_i16(ptr nocapture noundef writeonly %p,
+                               i16 noundef signext %v) local_unnamed_addr ...
[truncated]

Copy link
Contributor

@eddyz87 eddyz87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks fine, but syntax seems like it adds to the zoo we already have.
E.g. there are already instructions like lock *(u32 *)(r1 + 0x0) &= w2 and we treat these as having monotonic / memory_order_relaxed semantics (see #107343).
It seems, to me at-least, that it is more logical to build on top of already existing syntax, e.g.:

lock *(u64 *)(r1 + 0x0) = r2 release
lock r2 = *(u64 *)(r1 + 0x0) acquire

Also a question regarding 3 commits in one pull request.
As far as I understand current policy, the idea is to have a separate pull request for each commit (and fork branches from one another to build a stack). Otherwise the commits are squashed. What's the idea with current pull request?

llvm/lib/Target/BPF/BPFSubtarget.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/BPF/BPFInstrInfo.td Show resolved Hide resolved
@peilin-ye
Copy link
Contributor Author

Hi @eddyz87, thanks for the review and context!

lock *(u64 *)(r1 + 0x0) = r2 release
lock r2 = *(u64 *)(r1 + 0x0) acquire

Appending acquire and release does sound nice to me since it makes the syntax more similar to LLVM IR (e.g. store atomic i64 %v, ptr %p release, align 8), and it'd also be a bit easier when we support other memory ordering types in the future e.g. we can just append seq_cst instead of having to say load_seq_cst(...).

However, I didn't want to use lock because I feel like it's too similar to the x86 LOCK prefix (implies a full memory barrier, which could be confusing here). WDYT? Cc: @yonghong-song @4ast

Also a question regarding 3 commits in one pull request. As far as I understand current policy, the idea is to have a separate pull request for each commit (and fork branches from one another to build a stack). Otherwise the commits are squashed. What's the idea with current pull request?

Got it, I'll split this into separate PRs later.

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 18, 2024

@peilin-ye ,

However, I didn't want to use lock because I feel like it's too similar to the x86 LOCK prefix (implies a full memory barrier, which could be confusing here). WDYT? Cc: @yonghong-song @4ast

Well, we already confuse people in a way, since existing lock *(u64 *)(r1 + 0x0) += r2 does not imply full barrier (it is translated to stadd by arm jit in kernel, which is documented as "has no memory ordering semantics"). So, lock ... {acquire,release} shouldn't make things worse.

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 18, 2024

Also a question regarding 3 commits in one pull request. As far as I understand current policy, the idea is to have a separate pull request for each commit (and fork branches from one another to build a stack). Otherwise the commits are squashed. What's the idea with current pull request?

Got it, I'll split this into separate PRs later.

(Not that I like this policy, but that's the way LLVM team decided to use github 🤷‍♂️)

@4ast
Copy link
Member

4ast commented Sep 19, 2024

lock *(u64 *)(r1 + 0x0) = r2 release
lock r2 = *(u64 *)(r1 + 0x0) acquire

tbh I don't like such syntax. It's harder to read comparing to what the patch does:
r0 = load_acquire((u64 *)(r1 + 0))
store_release((u8 *)(r1 + 0x0), w2)

"lock" part doesn't fit here either. "lock" is x86 specific.

@peilin-ye
Copy link
Contributor Author

(pushed v2 to resolve the easier issues first :-)

I couldn't find a way to generate a better error message for __ATOMIC_SEQ_CST, however; it seems that CodeGen simply calls CannotYetSelect() if nothing in MatcherTable matches. Any suggestions?

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 20, 2024

@peilin-ye,

I couldn't find a way to generate a better error message for __ATOMIC_SEQ_CST, however; it seems that CodeGen simply calls CannotYetSelect() if nothing in MatcherTable matches. Any suggestions?

There is probably some tablegen incantation to invoke custom cpp for some matched pattern but I haven't found it. Also I have not found any target info hooks for clang to report unsupported orderings.

One option is to extend existing mechanics:

diff --git a/llvm/lib/Target/BPF/BPFISelLowering.cpp b/llvm/lib/Target/BPF/BPFISelLowering.cpp
index ff23d3b055d0..9e54c7f3de65 100644
--- a/llvm/lib/Target/BPF/BPFISelLowering.cpp
+++ b/llvm/lib/Target/BPF/BPFISelLowering.cpp
@@ -91,6 +91,7 @@ BPFTargetLowering::BPFTargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::ATOMIC_LOAD_XOR, VT, Custom);
     setOperationAction(ISD::ATOMIC_SWAP, VT, Custom);
     setOperationAction(ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, VT, Custom);
+    setOperationAction(ISD::ATOMIC_LOAD, VT, Custom);
   }
 
   for (auto VT : { MVT::i32, MVT::i64 }) {
@@ -291,6 +292,14 @@ void BPFTargetLowering::ReplaceNodeResults(
     else
       Msg = "unsupported atomic operation, please use 64 bit version";
     break;
+  case ISD::ATOMIC_LOAD: {
+    auto *AtomicLoad = cast<AtomicSDNode>(N);
+    if (AtomicLoad->getMergedOrdering() != AtomicOrdering::AcquireRelease &&
+        AtomicLoad->getMergedOrdering() != AtomicOrdering::SequentiallyConsistent)
+      return;
+    Msg = "a useful message about unsupported ordering here";
+    break;
+    }
   }
 
   SDLoc DL(N);

When compiled with -g this would report location and a message before complaining about selection error. But there should be a way to tweak existing fail function to stop after errors are reported.

@peilin-ye
Copy link
Contributor Author

peilin-ye commented Sep 21, 2024

@eddyz87,

Thanks! I didn't know about XXXISelLowering.cpp.

But there should be a way to tweak existing fail function to stop after errors are reported.

Can we exit(1) ? :-)

fail() calls LLVMContext::diagnose(), which already exit(1) when there's no "report handler", if "severity" is DS_Error :

  if (DI.getSeverity() == DS_Error)
    exit(1);
}

fail() uses DiagnosticInfoUnsupported, whose "severity" _is_ DS_Error, but our "report handler" (pImpl->DiagHandler->handleDiagnostics()) doesn't call exit() ...


I tried, based on your diff (__ATOMIC_ACQ_REL is illegal for __atomic_{load,store}{,_n}(), so we only need to handle AtomicOrdering::SequentiallyConsistent) :

--- a/llvm/lib/Target/BPF/BPFISelLowering.cpp
+++ b/llvm/lib/Target/BPF/BPFISelLowering.cpp
@@ -93,6 +93,9 @@ BPFTargetLowering::BPFTargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, VT, Custom);
   }
 
+  for (auto VT : {MVT::i8, MVT::i16, MVT::i32, MVT::i64})
+    setOperationAction(ISD::ATOMIC_LOAD, VT, Custom);
+
   for (auto VT : { MVT::i32, MVT::i64 }) {
     if (VT == MVT::i32 && !STI.getHasAlu32())
       continue;
@@ -291,6 +294,8 @@ void BPFTargetLowering::ReplaceNodeResults(
     else
       Msg = "unsupported atomic operation, please use 64 bit version";
     break;
+  case ISD::ATOMIC_LOAD:
+    return;
   }
 
   SDLoc DL(N);
@@ -316,6 +321,8 @@ SDValue BPFTargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
     return LowerSDIVSREM(Op, DAG);
   case ISD::DYNAMIC_STACKALLOC:
     return LowerDYNAMIC_STACKALLOC(Op, DAG);
+  case ISD::ATOMIC_LOAD:
+    return LowerATOMIC_LOAD(Op, DAG);
   }
 }
 
@@ -703,6 +710,22 @@ SDValue BPFTargetLowering::LowerSELECT_CC(SDValue Op, SelectionDAG &DAG) const {
   return DAG.getNode(BPFISD::SELECT_CC, DL, VTs, Ops);
 }
 
+SDValue BPFTargetLowering::LowerATOMIC_LOAD(SDValue Op,
+                                            SelectionDAG &DAG) const {
+  const char *Msg =
+      "sequentially consistent (seq_cst) atomic load is not supported";
+  SDNode *N = Op.getNode();
+  SDLoc DL(N);
+
+  if (cast<AtomicSDNode>(N)->getMergedOrdering() ==
+      AtomicOrdering::SequentiallyConsistent) {
+    fail(DL, DAG, Msg);
+    exit(1);
+  }
+
+  return Op;
+}
+
 const char *BPFTargetLowering::getTargetNodeName(unsigned Opcode) const {
   switch ((BPFISD::NodeType)Opcode) {
   case BPFISD::FIRST_NUMBER:
--- a/llvm/lib/Target/BPF/BPFISelLowering.h
+++ b/llvm/lib/Target/BPF/BPFISelLowering.h
@@ -77,7 +77,7 @@ private:
   SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerBR_CC(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerSELECT_CC(SDValue Op, SelectionDAG &DAG) const;
-
+  SDValue LowerATOMIC_LOAD(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerConstantPool(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerGlobalAddress(SDValue Op, SelectionDAG &DAG) const;
 

which seems to work nicely:

$ cat bar.c
char foo(char *ptr) { return __atomic_load_n(ptr, __ATOMIC_SEQ_CST); }
$
$ clang --target=bpf -mcpu=v5 -c bar.c > /dev/null
bar.c:1:6: error: sequentially consistent (seq_cst) atomic load is not supported
    1 | char foo(char *ptr) { return __atomic_load_n(ptr, __ATOMIC_SEQ_CST); }
      |      ^
$

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 24, 2024

@peilin-ye ,

Sorry for the delayed reply.

But there should be a way to tweak existing fail function to stop after errors are reported.

Can we exit(1) ? :-)

Let's not hard-code the exit(1) in the backend. In case if LLVM is used as a library, such exit might be quite unexpected.
I double checked and clang's diagnostic handler is installed in clang::BackendConsumer::HandleTranslationUnit() (clang/lib/CodeGen/CodeGenAction.cpp:294) it redirects all handling to BackendConsumer::DiagnosticHandlerImpl(). It does not seem customizable.

I'd say let's go with what you suggest, but avoid exit(1) call. We need to figure out how to do fail() w/o backtrace, but that is outside of the scope for this pull request.

@peilin-ye
Copy link
Contributor Author

@eddyz87,

Sorry for the delayed reply.

No worries at all!

I'd say let's go with what you suggest, but avoid exit(1) call. We need to figure out how to do fail() w/o backtrace, but that is outside of the scope for this pull request.

Got it, rebased and added a 4th commit (can't say I like the one-commit-per-PR policy either :-) to do this in v3. Please take another look.


Also, just a heads-up: I think this PR is blocked by an issue that I mentioned earlier on LKML (Cc: @4ast). After that is resolved, I think we should change this PR to use 0x5 for the new mode modifiers (BPF_MEMACQ, BPF_MEMREL).

@eddyz87
Copy link
Contributor

eddyz87 commented Sep 24, 2024

@peilin-ye ,

Got it, rebased and added a 4th commit (can't say I like the one-commit-per-PR policy either :-) to do this in v3. Please take another look.

lgtm, but please note CI failure:

�_bk;t=1727211158841�FAIL: LLVM :: CodeGen/BPF/acquire-release.ll (11379 of 55168)
�_bk;t=1727211158841�******************** TEST 'LLVM :: CodeGen/BPF/acquire-release.ll' FAILED ********************
�_bk;t=1727211158841�Exit Code: 2
�_bk;t=1727211158841�
�_bk;t=1727211158841�Command Output (stdout):
�_bk;t=1727211158841�--
�_bk;t=1727211158841�# RUN: at line 1
�_bk;t=1727211158841�c:\ws\src\build\bin\llc.exe < C:\ws\src\llvm\test\CodeGen\BPF\acquire-release.ll -march=bpfel -mcpu=v5 -verify-machineinstrs -show-mc-encoding | c:\ws\src\build\bin\filecheck.exe C:\ws\src\llvm\test\CodeGen\BPF\acquire-release.ll
�_bk;t=1727211158841�# executed command: 'c:\ws\src\build\bin\llc.exe' -march=bpfel -mcpu=v5 -verify-machineinstrs -show-mc-encoding
�_bk;t=1727211158841�# .---command stderr------------
�_bk;t=1727211158841�# | Assertion failed: From.getNode() != To.getNode() && "Potential legalization loop!", file C:\ws\src\llvm\lib\CodeGen\SelectionDAG\LegalizeTypes.cpp, line 646
�_bk;t=1727211158841�# | PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
�_bk;t=1727211158841�# | Stack dump:
�_bk;t=1727211158841�# | 0.	Program arguments: c:\\ws\\src\\build\\bin\\llc.exe -march=bpfel -mcpu=v5 -verify-machineinstrs -show-mc-encoding
�_bk;t=1727211158841�# | 1.	Running pass 'Function Pass Manager' on module '<stdin>'.
�_bk;t=1727211158841�# | 2.	Running pass 'BPF DAG->DAG Pattern Instruction Selection' on function '@store_release_i8'
�_bk;t=1727211158841�# | Exception Code: 0x80000003
�_bk;t=1727211158841�# |  #0 0x00007ff732c3a705 (c:\ws\src\build\bin\llc.exe+0x2b4a705)

@peilin-ye
Copy link
Contributor Author

lgtm, but please note CI failure:

Oops, I'll take a closer look later. Thanks!

@4ast
Copy link
Member

4ast commented Sep 25, 2024

Also, just a heads-up: I think this PR is blocked by an issue that I mentioned earlier on LKML (Cc: @4ast). After that is resolved, I think we should change this PR to use 0x5 for the new mode modifiers (BPF_MEMACQ, BPF_MEMREL).

this PR needs to land together with kernel/verifier changes. It's necessary to get llvm bits in landable shape first and
it sounds that it will be there once CI crash is resolved.

@peilin-ye peilin-ye force-pushed the BPF-load_acquire-store_release branch 2 times, most recently from 06ac94b to e23e67c Compare October 1, 2024 00:32
@peilin-ye
Copy link
Contributor Author

(Back on this; apologies for the delay)

Pushed v4 to fix the acquire-release.ll failure:

Assuming HasAlu32, for ATOMIC_STORE, if operand #1 (value to store) is an i8 or i16, we need to promote it to i32 in the Legalize phase. Specifically, in BPFTargetLowering::LowerATOMIC_STORE(), return SDValue() if type of operand #1 is "illegal" (i8 or i16), to tell DAGTypeLegalizer::PromoteIntegerOperand() to promote it to i32.

@peilin-ye
Copy link
Contributor Author

peilin-ye commented Oct 1, 2024

Hi @4ast,

Also, just a heads-up: I think this PR is blocked by an issue that I mentioned earlier on LKML (Cc: @4ast). After that is resolved, I think we should change this PR to use 0x5 for the new mode modifiers (BPF_MEMACQ, BPF_MEMREL).

this PR needs to land together with kernel/verifier changes. It's necessary to get llvm bits in landable shape first and it sounds that it will be there once CI crash is resolved.

Got it! I wanted to get your opinion on the linked issue though:

This PR can use either 5 or 7 for the new mode modifier, according to instruction-set.rst:

    IMM    0
    ABS    1
    IND    2
    MEM    3
    MEMSX  4
    ATOMIC 6

However, right now both 5 and 7 are used by the verifier internally:

        /* unused opcode to mark special load instruction. Same as BPF_MSH */
        #define BPF_PROBE_MEM32 0xa0

(0xa0 == 5 << 5)

        /* unused opcode to mark special atomic instruction */
        #define BPF_PROBE_ATOMIC 0xe0

(0xe0 == 7 << 5)

Can I move them elsewhere to make room for this PR? Like, can I make them use the unused higher bits of the imm field (which seems pretty empty for LD, LDX, ST, STX instructions) instead?

@yonghong-song
Copy link
Contributor

@peilin-ye Sorry for late review. A few general observations below:

  1. I think -mcpu=v5 can be avoided. We really want to accumulate quite some insns before increasing cpu version. Otherwise, we may quickly reach v7/v8/... which people will lose check which version supports what. We might need to use feature flags later (e.g. for x86, -msse, -mavx, etc.). So far, I suggest to stick to cpu v4.

  2. I tried below example:

$ cat t4.c
short foo(short *ptr) {
  return __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
}
$ clang --target=bpf -g -c -O2 t4.c -mcpu=v5
$ llvm-objdump -d t4.o

t4.o:   file format elf64-bpf

Disassembly of section .text:

0000000000000000 <foo>:
       0:       e9 10 00 00 00 00 00 00 w0 = load_acquire((u16 *)(r1 + 0x0))
       1:       95 00 00 00 00 00 00 00 exit

From the source code, the operation should return a 'short' type value. But since the result will be 32bit value, should we do sign-extension here? w0 = load_acquire((s16 *)(r1 + 0x0))?

  1. Currently you have
    def BPF_ATOMIC : BPFModeModifer<0x6>;
    +def BPF_MEMACQ : BPFModeModifer<0x7>;
    +def BPF_MEMREL : BPFModeModifer<0x7>;

Since load_acquire and store_release are essentially atomic operations (e.g. __atomic_load_n() and __atomic_store_n()), is it possible to add acquire/release support in BPF_ATOMIC? We might want to reserve BPFModelModifier 7 for future use.

@peilin-ye
Copy link
Contributor Author

Hi @yonghong-song,

No worries, and thanks for taking a look at this!

So far, I suggest to stick to cpu v4.

I see, I can do that.

is it possible to add acquire/release support in BPF_ATOMIC? We might want to reserve BPFModelModifier 7 for future use.

Sounds good, that way we also avoid the "5 and 7 are already internally used by the verifier" problem mentioned earlier. I'll try adding one or more new flags to the imm field.

should we do sign-extension here? w0 = load_acquire((s16 *)(r1 + 0x0))?

Not sure about this one. I'll learn about MEMSX and look at other backends.

@4ast
Copy link
Member

4ast commented Oct 2, 2024

Since load_acquire and store_release are essentially atomic operations (e.g. __atomic_load_n() and __atomic_store_n()), > is it possible to add acquire/release support in BPF_ATOMIC?

We discussed it with Eduard as well and came to similar conclusion.
BPF_STX | BPF_ATOMIC and imm = LOAD_ACQ or imm = STORE_REL
would fit better than using ModeModifier for these two new insns.

The verifier is using Mode to express either arena or probe-ing needs.
Like BPF_STX | BPF_ATOMIC is converted to BPF_STX | BPF_PROBE_ATOMIC by the verifier
and JITs that don't support atomic insns in arena will complain about such insns,
because such opcode will not be found in the big switch() statement.

New load_acq/store_rel insn would need to be supported in arena.
If we use Mode for them it will be hard to convey the change to JITs.
But if we go with imm = LOAD_ACQ or imm = STORE_REL approach
the BPF_ATOMIC -> BPF_PROBE_ATOMIC conversion will cover these insns automatically.
We need to tweak bpf_jit_supports_insn(), but internal protocol between the verifier and JITs
will stay the same.

BPF_STX | BPF_ATOMIC opcode for load acquire insn may look a bit odd,
but once you consider that this opcode is used to return the value for atomic_fetch_or|add... insns
it doesn't look as odd.

@peilin-ye peilin-ye force-pushed the BPF-load_acquire-store_release branch from e23e67c to aa15d26 Compare October 8, 2024 01:00
@peilin-ye peilin-ye changed the title [BPF] Add load-acquire and store-release instructions under -mcpu=v5 [BPF] Add load-acquire and store-release instructions under -mcpu=v4 Oct 8, 2024
@peilin-ye
Copy link
Contributor Author

peilin-ye commented Oct 8, 2024

Thanks for all your suggestions! Main changes in patchset v5:

  1. use -mcpu=v4 instead of -mcpu=v5 as suggested by Yonghong
  2. make both load-acquire and store-release BPF_STX | BPF_ATOMIC insns, instead of introducing new mode modifiers:

BPF_STX | BPF_ATOMIC insns use a subset of BPFArithOp<> in bit 4-7 of the imm field. Currently these are in use:

def BPF_ADD  : BPFArithOp<0x0>;
def BPF_OR   : BPFArithOp<0x4>;
def BPF_AND  : BPFArithOp<0x5>;
def BPF_XOR  : BPFArithOp<0xa>;
def BPF_XCHG    : BPFArithOp<0xe>;
def BPF_CMPXCHG : BPFArithOp<0xf>;

On top of this, I tentatively chose 0x1 for BPF_LOAD_ACQ, and 0xb for BPF_STORE_REL, because:

  • 0x1 is BPF_SUB in BPFArithOp<>, but atomic SUB is implemented using a NEG + ADD:
    // atomic_load_sub can be represented as a neg followed
    // by an atomic_load_add.
  • 0xb is BPF_MOV in BPFArithOp<>, which is "moving value between registers" and isn't applicable for atomic (memory) operations

Please let me know if you have better suggestions for encoding (or if I'm overthinking this and we can simply use 0x1 and 0x2).


@eddyz87, do my changes to BPFMISimplifyPatchable.cpp still look good to you? Now that load-acquires are STX insns, I wanted to make sure that BPFMISimplifyPatchable::checkADDrr() can still handle them correctly for CO-RE.

I'll add tests to llvm/test/CodeGen/BPF/CORE/ later, like what you did in commit 08d92de ("[BPF] Fix in/out argument constraints for CORE_MEM instructions").


About whether we should sign-extend for 8- and 16-bit load-acquires (brought up by Yonghong):

All ARM64 insns that match acquiring_load<atomic_load[_az]_{8,16}> seem to zero-extend the value before writing it to register, like, LDAPRH:

Load-Acquire RCpc Register Halfword derives an address from a base register value, loads a halfword from the derived address in memory, zero-extends it and writes it to a register.

So right now I'm keeping our LDBACQ32 and LDHACQ32 to zero-extend. I'll take a look at other archs later.

@yonghong-song
Copy link
Contributor

For

About whether we should sign-extend for 8- and 16-bit load-acquires (brought up by Yonghong):

All ARM64 insns that match acquiring_load<atomic_load[az]{8,16}> seem to zero-extend the value before writing it to register, like, LDAPRH:

Load-Acquire RCpc Register Halfword derives an address from a base register value, loads a halfword from the derived address in memory, zero-extends it and writes it to a register.

So right now I'm keeping our LDBACQ32 and LDHACQ32 to zero-extend. I'll take a look at other archs later.

I think you need to implement kernel part to utilize the newly-introduced bpf insns. Then we can see whether jit'ed code works as expected or not.

let Predicates = [BPFHasALU32], DecoderNamespace = "BPFALU32" in {
def LDW32 : LOADi32<BPF_W, BPF_MEM, "u32", load>;
def LDH32 : LOADi32<BPF_H, BPF_MEM, "u16", zextloadi16>;
def LDB32 : LOADi32<BPF_B, BPF_MEM, "u8", zextloadi8>;

let Predicates = [BPFHasLoadAcquire] in {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to define these (and stores) behind BPFHasALU32 flag?
From instruction encoding pow we use STX class, which has no 32/64 sub-division.

Copy link
Contributor Author

@peilin-ye peilin-ye Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to define these (and stores) behind BPFHasALU32 flag?

Not really. Looks like there is no way to turn off HasALU32 for v4. I just wanted to make it clear in the .td file that we are using half-registers for 8-, 16- and 32-bit insns.

Also I wanted to define them behind DecoderNamespace = "BPFALU32" because currently BPFDisassembler::getInstruction() uses DecoderTableBPFALU3264 for all non-64-bit BPF_STX | BPF_ATOMIC insns:

  if ((InstClass == BPF_LDX || InstClass == BPF_STX) &&
      getInstSize(Insn) != BPF_DW &&
      (InstMode == BPF_MEM || InstMode == BPF_ATOMIC) &&
      STI.hasFeature(BPF::ALU32))
    Result = decodeInstruction(DecoderTableBPFALU3264, Instr, Insn, Address,
                               this, STI);
  else
    Result = decodeInstruction(DecoderTableBPF64, Instr, Insn, Address, this,
                               STI);

So if I move them out of DecoderNamespace = "BPFALU32" and keep BPFDisassembler.cpp as-is, llvm-objdump would give me:

0000000000000000 <bar>:
;     __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
       0:	cb 21 00 00 b0 00 00 00	<unknown>
; }
       1:	95 00 00 00 00 00 00 00	exit

From instruction encoding pow we use STX class, which has no 32/64 sub-division.

What I'm doing is similar to XXOR*: right now we only have XXORD and XXORW32 (both BPF_STX), and XXORW32 is behind Predicates = [BPFHasALU32], DecoderNamespace = "BPFALU32". There is no "ALU64 version" of XXORW32.

llvm/lib/Target/BPF/BPFInstrInfo.td Show resolved Hide resolved
@eddyz87
Copy link
Contributor

eddyz87 commented Oct 9, 2024

@eddyz87, do my changes to BPFMISimplifyPatchable.cpp still look good to you? Now that load-acquires are STX insns, I wanted to make sure that BPFMISimplifyPatchable::checkADDrr() can still handle them correctly for CO-RE.

The checkADDrr() change looks ok.

I'll add tests to llvm/test/CodeGen/BPF/CORE/ later, like what you did in commit 08d92de ("[BPF] Fix in/out argument constraints for CORE_MEM instructions").

👍

Also, commit message says:

For example:

long foo(long *ptr) {
    return __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
}

foo() can be compiled to:

db 10 00 00 10 00 00 00  r0 = load_acquire((u64 *)(r1 + 0x0))
95 00 00 00 00 00 00 00  exit

opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX
imm (0x00000010): BPF_LOAD_ACQ

Do we want to use LDX for this instruction? e.g.:

 opcode (0x3e): BPF_ATOMIC | BPF_DW | BPF_LDX
 imm (0x00000010): BPF_LOAD_ACQ

@4ast
Copy link
Member

4ast commented Oct 10, 2024

BPF_ATOMIC | BPF_DW | BPF_LDX

I'd rather not. See my earlier comment. Existing fetch flavors return the value.

@peilin-ye peilin-ye force-pushed the BPF-load_acquire-store_release branch 2 times, most recently from 9a84ed5 to 7d6abf8 Compare October 17, 2024 00:03
@peilin-ye
Copy link
Contributor Author

Back on this; rebased and pushed patchset v6 to make acquire-release.ll also cover -march=bpfeb as suggested by Eduard. Basically:

; CHECK-LE: w0 = load_acquire((u8 *)(r1 + 0)) # encoding: [0xd3,0x10,0x00,0x00,0x10,0x00,0x00,0x00]
; CHECK-BE: w0 = load_acquire((u8 *)(r1 + 0)) # encoding: [0xd3,0x01,0x00,0x00,0x00,0x00,0x00,0x10]

src_reg and dst_reg are swapped, and imm is in big endian. Currently focusing on making kernel-side changes (verifier and JITs).


That weird fatal: reference is not a tree: CI error looks similar to the one mentioned in PR 73506. Running check-all locally for now...

…cpp (NFC)

We are planning to add load (specifically, atomic acquiring load, or
"load-acquire") instructions under the STX instruction class.  To make
that easier, rename the isST*() and isLD*() helper functions based on
what the instructions actually do, rather than their instruction class.
As discussed in [1], introduce BPF instructions with load-acquire and
store-release semantics under -mcpu=v4.

A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm'
field set to BPF_LOAD_ACQ (0x1, acquiring atomic load).  Similarly, a
"store-release" is a BPF_STX | BPF_ATOMIC instruction with the 'imm'
field set to BPF_STORE_REL (0xb, releasing atomic store).

Unlike existing atomic operations that only support BPF_W (32-bit) and
BPF_DW (64-bit) size modifiers, load-acquires and store-releases also
support BPF_B (8-bit) and BPF_H (16-bit).  An 8- or 16-bit load-acquire
zero-extends the value before writing it to a 32-bit register, just like
ARM64 instruction LDAPRH and friends.

As an example, for -march=bpfel:

  long foo(long *ptr) {
      return __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
  }

foo() can be compiled to:

  db 10 00 00 10 00 00 00  r0 = load_acquire((u64 *)(r1 + 0x0))
  95 00 00 00 00 00 00 00  exit

  opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX
  imm (0x00000010): BPF_LOAD_ACQ

Similarly:

  void bar(short *ptr, short val) {
      __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
  }

bar() can be compiled to:

  cb 21 00 00 b0 00 00 00  store_release((u16 *)(r1 + 0x0), w2)
  95 00 00 00 00 00 00 00  exit

  opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX
  imm (0x000000b0): BPF_STORE_REL

Inline assembly is also supported.  For example:

  asm volatile("%0 = load_acquire((u64 *)(%1 + 0x0))" :
               "=r"(ret) : "r"(ptr) : "memory");

Add two macros, __BPF_FEATURE_LOAD_ACQUIRE and
__BPF_FEATURE_STORE_RELEASE, to let developers detect these new features
in source code.  They can also be disabled using two new llc options,
-disable-load-acquire and -disable-store-release, respectively.

Also use ACQUIRE or RELEASE if user requested weaker memory orders
(RELAXED or CONSUME) until we actually support them.  Requesting a
stronger memory order (i.e. SEQ_CST) will cause an error.

[1] https://lore.kernel.org/all/[email protected]/
Sequentially consistent (seq_cst) atomic load and store are not
supported yet for BPF.  Right now, calling __atomic_{load,store}{,_n}()
with __ATOMIC_SEQ_CST will cause an error:

  $ cat bar.c
  int foo(int *ptr) { return __atomic_load_n(ptr, __ATOMIC_SEQ_CST); }
  $ clang --target=bpf -mcpu=v4 -c bar.c > /dev/null
  fatal error: error in backend: Cannot select: t8: i32,ch = AtomicLoad<(load seq_cst (s32) from %ir.0)> t7:1, t7
  ...

Which isn't very useful.  Just like commit 379d908 ("BPF: provide
better error message for unsupported atomic operations"), make it
generate an error message saying that the requested operation isn't
supported, before triggering that "fatal error":

  $ clang --target=bpf -mcpu=v4 -c bar.c > /dev/null
  bar.c:1:5: error: sequentially consistent (seq_cst) atomic load is not supported
    1 | int foo(int *ptr) { return __atomic_load_n(ptr, __ATOMIC_SEQ_CST); }
      |     ^
  ...
@peilin-ye peilin-ye force-pushed the BPF-load_acquire-store_release branch from 7d6abf8 to 827f517 Compare October 17, 2024 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:BPF clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category llvm:binary-utilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants