New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[AArch64] Stack probing for function prologues #66524

Merged

momchil-velikov merged 5 commits into llvm:main from momchil-velikov:stack-clash-protection

Nov 30, 2023

Collaborator

momchil-velikov commented Sep 15, 2023

This adds code to AArch64 function prologues to protect against stack clash attacks by probing (writing to) the stack at regular enough intervals to ensure that the guard page cannot be skipped over.

The patch depends on and maintains the following invariants:

Upon function entry the caller guarantees that it has probed the stack (e.g. performed a store) at some address [sp, #N], where0 <= N <= 1024. This invariant comes from a requirement for compatibility with GCC. Any address range in the allocated stack, no smaller than stack-probe-size bytes contains at least one probe At any time the stack pointer is above or in the guard page Probes are performed in descreasing address order
The stack-probe-size is a function attribute that can be set by a platform to correspond to the guard page size.

By default, the stack probe size is 4KiB, which is a safe default as this is the smallest possible page size for AArch64. Linux uses a 64KiB guard for AArch64, so this can be overridden by the stack-probe-size function attribute.

For small frames without a frame pointer (<= 240 bytes), no probes are needed.

For larger frame sizes, LLVM always stores x29 to the stack. This serves as an implicit stack probe. Thus, while allocating stack objects the compiler assumes that the stack has been probed at [sp].

There are multiple probing sequences that can be emitted, depending on the size of the stack allocation:

A straight-line sequence of subtracts and stores, used when the allocation size is smaller than 5 guard pages. A loop allocating and probing one page size per iteration, plus at most a single probe to deal with the remainder, used when the allocation size is larger but still known at compile time. A loop which moves the SP down to the target value held in a register (or a loop, moving a scratch register to the target value help in SP), used when the allocation size is not known at compile-time, such as when allocating space for SVE values, or when over-aligning the stack. This is emitted in AArch64InstrInfo because it will also be used for dynamic allocas in a future patch. A single probe where the amount of stack adjustment is unknown, but is known to be less than or equal to a page size.

llvmbot added the backend:AArch64 label

llvmbot commented Sep 15, 2023 •

edited

Loading

@llvm/pr-subscribers-clang-driver
@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-aarch64

Changes

This adds code to AArch64 function prologues to protect against stack clash attacks by probing (writing to) the stack at regular enough intervals to ensure that the guard page cannot be skipped over.

The patch depends on and maintains the following invariants:

Upon function entry the caller guarantees that it has probed the stack (e.g. performed a store) at some address [sp, #N], where0 <= N <= 1024. This invariant comes from a requirement for compatibility with GCC. Any address range in the allocated stack, no smaller than stack-probe-size bytes contains at least one probe At any time the stack pointer is above or in the guard page Probes are performed in descreasing address order
The stack-probe-size is a function attribute that can be set by a platform to correspond to the guard page size.

By default, the stack probe size is 4KiB, which is a safe default as this is the smallest possible page size for AArch64. Linux uses a 64KiB guard for AArch64, so this can be overridden by the stack-probe-size function attribute.

For small frames without a frame pointer (<= 240 bytes), no probes are needed.

For larger frame sizes, LLVM always stores x29 to the stack. This serves as an implicit stack probe. Thus, while allocating stack objects the compiler assumes that the stack has been probed at [sp].

There are multiple probing sequences that can be emitted, depending on the size of the stack allocation:

A straight-line sequence of subtracts and stores, used when the allocation size is smaller than 5 guard pages. A loop allocating and probing one page size per iteration, plus at most a single probe to deal with the remainder, used when the allocation size is larger but still known at compile time. A loop which moves the SP down to the target value held in a register (or a loop, moving a scratch register to the target value help in SP), used when the allocation size is not known at compile-time, such as when allocating space for SVE values, or when over-aligning the stack. This is emitted in AArch64InstrInfo because it will also be used for dynamic allocas in a future patch. A single probe where the amount of stack adjustment is unknown, but is known to be less than or equal to a page size.

Patch is 100.06 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66524.diff

12 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64FrameLowering.cpp (+361-31)
(modified) llvm/lib/Target/AArch64/AArch64FrameLowering.h (+23)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+34)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+14)
(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.cpp (+92-1)
(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.h (+5)
(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+19-2)
(modified) llvm/test/CodeGen/AArch64/framelayout-sve.mir (+2-2)
(modified) llvm/test/CodeGen/AArch64/spill-stack-realignment.mir (+1-1)
(added) llvm/test/CodeGen/AArch64/stack-probing-64k.ll (+392)
(added) llvm/test/CodeGen/AArch64/stack-probing-sve.ll (+661)
(added) llvm/test/CodeGen/AArch64/stack-probing.ll (+474)

diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
index 68e68449d4073b2..2bf61bdddf3d511 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -301,6 +301,7 @@ static bool produceCompactUnwindFrame(MachineFunction &MF);
 static bool needsWinCFI(const MachineFunction &MF);
 static StackOffset getSVEStackSize(const MachineFunction &MF);
 static bool needsShadowCallStackPrologueEpilogue(MachineFunction &MF);
+static unsigned findScratchNonCalleeSaveRegister(MachineBasicBlock *MBB);
 
 /// Returns true if a homogeneous prolog or epilog code can be emitted
 /// for the size optimization. If possible, a frame helper call is injected.
@@ -672,6 +673,74 @@ void AArch64FrameLowering::emitCalleeSavedSVERestores(
   emitCalleeSavedRestores(MBB, MBBI, true);
 }
 
+void AArch64FrameLowering::allocateSVEStackSpace(
+    MachineBasicBlock &MBB, MachineBasicBlock::iterator MBBI,
+    StackOffset AllocSize, StackOffset InitialOffset, bool EmitCFI) const {
+  DebugLoc DL;
+  MachineFunction &MF = *MBB.getParent();
+  const AArch64Subtarget &Subtarget = MF.getSubtarget<AArch64Subtarget>();
+  const AArch64RegisterInfo &RegInfo = *Subtarget.getRegisterInfo();
+  const AArch64TargetLowering &TLI = *Subtarget.getTargetLowering();
+  const TargetInstrInfo &TII = *Subtarget.getInstrInfo();
+
+  // If not probing the stack or the (uknown) allocation size is less than the
+  // probe size decrement the stack pointer right away. This avoids having to
+  // emit a probing loop when allocating space for up to 16 SVE registers when
+  // using 4k probes.
+
+  // The bit-length of SVE registers is architecturally limited.
+  const int64_t MAX_BYTES_PER_SCALABLE_BYTE = 16;
+  int64_t ProbeSize = TLI.getStackProbeSize(MF);
+  if (!TLI.hasInlineStackProbe(MF) ||
+      AllocSize.getScalable() * MAX_BYTES_PER_SCALABLE_BYTE +
+              AllocSize.getFixed() <=
+          ProbeSize) {
+    emitFrameOffset(MBB, MBBI, DL, AArch64::SP, AArch64::SP, -AllocSize, &TII,
+                    MachineInstr::FrameSetup, false, false, nullptr, EmitCFI,
+                    InitialOffset);
+    if (TLI.hasInlineStackProbe(MF)) {
+      // Issue a probe at the top of the stack to prepare for subsequent
+      // allocations.
+      // STR XZR, [TargetReg]
+      BuildMI(MBB, MBBI, DL, TII.get(AArch64::STRXui))
+          .addReg(AArch64::XZR)
+          .addReg(AArch64::SP)
+          .addImm(0)
+          .setMIFlags(MachineInstr::FrameSetup);
+    }
+    return;
+  }
+
+  // If we can't be sure the allocation size if less than the probe size, we
+  // have to emit a stack probing loop.
+  Register ScratchReg = findScratchNonCalleeSaveRegister(&MBB);
+  assert(ScratchReg != AArch64::NoRegister);
+  // Get the new top of the stack into a scratch register.
+  emitFrameOffset(MBB, MBBI, DL, ScratchReg, AArch64::SP, -AllocSize, &TII,
+                  MachineInstr::FrameSetup, false, false, nullptr, EmitCFI,
+                  InitialOffset);
+  // Arrange to emit a probing loop by decrementing SP until it reaches that
+  // new top of the stack.
+  BuildMI(MBB, MBBI, DL, TII.get(AArch64::PROBED_STACKALLOC_VAR), AArch64::SP)
+      .addReg(ScratchReg);
+  // Set SP to its new value.
+  // MOV SP, Xs
+  BuildMI(MBB, MBBI, DL, TII.get(AArch64::ADDXri), AArch64::SP)
+      .addReg(ScratchReg)
+      .addImm(0)
+      .addImm(AArch64_AM::getShifterImm(AArch64_AM::LSL, 0))
+      .setMIFlags(MachineInstr::FrameSetup);
+  if (EmitCFI) {
+    // Set the CFA register back to SP.
+    unsigned Reg = RegInfo.getDwarfRegNum(AArch64::SP, true);
+    unsigned CFIIndex =
+        MF.addFrameInst(MCCFIInstruction::createDefCfaRegister(nullptr, Reg));
+    BuildMI(MBB, MBBI, DL, TII.get(TargetOpcode::CFI_INSTRUCTION))
+        .addCFIIndex(CFIIndex)
+        .setMIFlags(MachineInstr::FrameSetup);
+  }
+}
+
 static MCRegister getRegisterOrZero(MCRegister Reg, bool HasSVE) {
   switch (Reg.id()) {
   default:
@@ -855,9 +924,11 @@ bool AArch64FrameLowering::canUseAsPrologue(
   MachineBasicBlock *TmpMBB = const_cast<MachineBasicBlock *>(&MBB);
   const AArch64Subtarget &Subtarget = MF->getSubtarget<AArch64Subtarget>();
   const AArch64RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
+  const AArch64TargetLowering *TLI = Subtarget.getTargetLowering();
 
-  // Don't need a scratch register if we're not going to re-align the stack.
-  if (!RegInfo->hasStackRealignment(*MF))
+  // Don't need a scratch register if we're not going to re-align the stack or
+  // emit stack probes.
+  if (!RegInfo->hasStackRealignment(*MF) && TLI->hasInlineStackProbe(*MF))
     return true;
   // Otherwise, we can use any block as long as it has a scratch register
   // available.
@@ -1429,6 +1500,7 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
   const Function &F = MF.getFunction();
   const AArch64Subtarget &Subtarget = MF.getSubtarget<AArch64Subtarget>();
   const AArch64RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
+  const AArch64TargetLowering &TLI = *Subtarget.getTargetLowering();
   const TargetInstrInfo *TII = Subtarget.getInstrInfo();
   MachineModuleInfo &MMI = MF.getMMI();
   AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
@@ -1784,12 +1856,14 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
     }
   }
 
-  StackOffset AllocateBefore = SVEStackSize, AllocateAfter = {};
+  StackOffset SVECalleeSavedSize = {}, SVELocalsSize = SVEStackSize;
   MachineBasicBlock::iterator CalleeSavesBegin = MBBI, CalleeSavesEnd = MBBI;
 
   // Process the SVE callee-saves to determine what space needs to be
   // allocated.
   if (int64_t CalleeSavedSize = AFI->getSVECalleeSavedStackSize()) {
+    LLVM_DEBUG(dbgs() << "SVECalleeSavedStackSize = " << CalleeSavedSize
+                      << "\n");
     // Find callee save instructions in frame.
     CalleeSavesBegin = MBBI;
     assert(IsSVECalleeSave(CalleeSavesBegin) && "Unexpected instruction");
@@ -1797,33 +1871,40 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
       ++MBBI;
     CalleeSavesEnd = MBBI;
 
-    AllocateBefore = StackOffset::getScalable(CalleeSavedSize);
-    AllocateAfter = SVEStackSize - AllocateBefore;
+    SVECalleeSavedSize = StackOffset::getScalable(CalleeSavedSize);
+    SVELocalsSize = SVEStackSize - SVECalleeSavedSize;
+
+    // Allocate space for the SVE callee saves.
+    if (SVECalleeSavedSize) {
+      allocateSVEStackSpace(
+          MBB, CalleeSavesBegin, SVECalleeSavedSize,
+          StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes),
+          EmitAsyncCFI && !HasFP);
+      if (EmitAsyncCFI)
+        emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
+    }
   }
 
-  // Allocate space for the callee saves (if any).
-  emitFrameOffset(
-      MBB, CalleeSavesBegin, DL, AArch64::SP, AArch64::SP, -AllocateBefore, TII,
-      MachineInstr::FrameSetup, false, false, nullptr,
-      EmitAsyncCFI && !HasFP && AllocateBefore,
-      StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes));
-
-  if (EmitAsyncCFI)
-    emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
-
-  // Finally allocate remaining SVE stack space.
-  emitFrameOffset(MBB, CalleeSavesEnd, DL, AArch64::SP, AArch64::SP,
-                  -AllocateAfter, TII, MachineInstr::FrameSetup, false, false,
-                  nullptr, EmitAsyncCFI && !HasFP && AllocateAfter,
-                  AllocateBefore + StackOffset::getFixed(
-                                       (int64_t)MFI.getStackSize() - NumBytes));
+  // Allocate stack space for the local SVE objects.
+  if (SVELocalsSize)
+    allocateSVEStackSpace(
+        MBB, CalleeSavesEnd, SVELocalsSize,
+        SVECalleeSavedSize +
+            StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes),
+        EmitAsyncCFI && !HasFP);
 
   // Allocate space for the rest of the frame.
   if (NumBytes) {
     unsigned scratchSPReg = AArch64::SP;
+    bool NeedsStackProbe = TLI.hasInlineStackProbe(MF) &&
+                           (NumBytes > AArch64::StackProbeMaxUnprobedStack ||
+                            MFI.hasVarSizedObjects());
 
     if (NeedsRealignment) {
       scratchSPReg = findScratchNonCalleeSaveRegister(&MBB);
+      NeedsStackProbe |= TLI.hasInlineStackProbe(MF) &&
+                         (NumBytes + MFI.getMaxAlign().value()) >
+                             AArch64::StackProbeMaxUnprobedStack;
       assert(scratchSPReg != AArch64::NoRegister);
     }
 
@@ -1832,12 +1913,36 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
       // FIXME: in the case of dynamic re-alignment, NumBytes doesn't have
       // the correct value here, as NumBytes also includes padding bytes,
       // which shouldn't be counted here.
-      emitFrameOffset(
-          MBB, MBBI, DL, scratchSPReg, AArch64::SP,
-          StackOffset::getFixed(-NumBytes), TII, MachineInstr::FrameSetup,
-          false, NeedsWinCFI, &HasWinCFI, EmitAsyncCFI && !HasFP,
+      StackOffset CFAOffset =
           SVEStackSize +
-              StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes));
+          StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes);
+      if (NeedsStackProbe && !NeedsRealignment) {
+        // If we don't need to re-align the stack, we can use a more efficient
+        // sequence for stack probing.
+        Register ScratchReg = findScratchNonCalleeSaveRegister(&MBB);
+        assert(ScratchReg != AArch64::NoRegister);
+        BuildMI(MBB, MBBI, DL, TII->get(AArch64::PROBED_STACKALLOC))
+            .addDef(ScratchReg)
+            .addImm(NumBytes)
+            .addImm(CFAOffset.getFixed())
+            .addImm(CFAOffset.getScalable());
+        // The fixed allocation may leave unprobed bytes at the top of the
+        // stack. If we have variable-sized objects, we need to issue an extra
+        // probe, so their allocations starts in a known state.
+        if (MFI.hasVarSizedObjects()) {
+          // STR XZR, [SP]
+          BuildMI(MBB, MBBI, DL, TII->get(AArch64::STRXui))
+              .addReg(AArch64::XZR)
+              .addReg(AArch64::SP)
+              .addImm(0)
+              .setMIFlags(MachineInstr::FrameSetup);
+        }
+      } else {
+        emitFrameOffset(MBB, MBBI, DL, scratchSPReg, AArch64::SP,
+                        StackOffset::getFixed(-NumBytes), TII,
+                        MachineInstr::FrameSetup, false, NeedsWinCFI,
+                        &HasWinCFI, EmitAsyncCFI && !HasFP, CFAOffset);
+      }
     }
     if (NeedsRealignment) {
       assert(MFI.getMaxAlign() > Align(1));
@@ -1846,12 +1951,48 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
       // SUB X9, SP, NumBytes
       //   -- X9 is temporary register, so shouldn't contain any live data here,
       //   -- free to use. This is already produced by emitFrameOffset above.
-      // AND SP, X9, 0b11111...0000
-      uint64_t AndMask = ~(MFI.getMaxAlign().value() - 1);
 
-      BuildMI(MBB, MBBI, DL, TII->get(AArch64::ANDXri), AArch64::SP)
-          .addReg(scratchSPReg, RegState::Kill)
-          .addImm(AArch64_AM::encodeLogicalImmediate(AndMask, 64));
+      const uint64_t MaxAlign = MFI.getMaxAlign().value();
+      const uint64_t AndMask = ~(MaxAlign - 1);
+
+      if (NeedsStackProbe) {
+        // If allocation size is known to not exceed the probe size, don't emit
+        // a probing loop.
+        if (NumBytes + MaxAlign - 1 <= TLI.getStackProbeSize(MF)) {
+          // AND SP, X9, 0b11111...0000
+          BuildMI(MBB, MBBI, DL, TII->get(AArch64::ANDXri), AArch64::SP)
+              .addReg(scratchSPReg, RegState::Kill)
+              .addImm(AArch64_AM::encodeLogicalImmediate(AndMask, 64))
+              .setMIFlags(MachineInstr::FrameSetup);
+          // STR XZR, [SP]
+          BuildMI(MBB, MBBI, DL, TII->get(AArch64::STRXui))
+              .addReg(AArch64::XZR)
+              .addReg(AArch64::SP)
+              .addImm(0)
+              .setMIFlags(MachineInstr::FrameSetup);
+        } else {
+          // AND X9, X9, 0b11111...0000
+          BuildMI(MBB, MBBI, DL, TII->get(AArch64::ANDXri), scratchSPReg)
+              .addReg(scratchSPReg, RegState::Kill)
+              .addImm(AArch64_AM::encodeLogicalImmediate(AndMask, 64))
+              .setMIFlags(MachineInstr::FrameSetup);
+          BuildMI(MBB, MBBI, DL, TII->get(AArch64::PROBED_STACKALLOC_VAR),
+                  AArch64::SP)
+              .addReg(scratchSPReg);
+          // MOV SP, X9
+          BuildMI(MBB, MBBI, DL, TII->get(AArch64::ADDXri), AArch64::SP)
+              .addReg(scratchSPReg)
+              .addImm(0)
+              .addImm(AArch64_AM::getShifterImm(AArch64_AM::LSL, 0))
+              .setMIFlags(MachineInstr::FrameSetup);
+        }
+      } else {
+        // AND SP, X9, 0b11111...0000
+        BuildMI(MBB, MBBI, DL, TII->get(AArch64::ANDXri), AArch64::SP)
+            .addReg(scratchSPReg, RegState::Kill)
+            .addImm(AArch64_AM::encodeLogicalImmediate(AndMask, 64))
+            .setMIFlags(MachineInstr::FrameSetup);
+      }
       AFI->setStackRealigned(true);
 
       // No need for SEH instructions here; if we're realigning the stack,
@@ -4057,3 +4198,192 @@ void AArch64FrameLowering::orderFrameObjects(
     dbgs() << "\n";
   });
 }
+
+/// Emit a loop to decrement SP until it is equal to TargetReg, with probes at
+/// least every ProbeSize bytes. Returns an iterator of the first instruction
+/// after the loop. The difference between SP and TargetReg must be an exact
+/// multiple of ProbeSize.
+MachineBasicBlock::iterator
+AArch64FrameLowering::inlineStackProbeLoopExactMultiple(
+    MachineBasicBlock::iterator MBBI, int64_t ProbeSize,
+    Register TargetReg) const {
+  MachineBasicBlock &MBB = *MBBI->getParent();
+  MachineFunction &MF = *MBB.getParent();
+  const AArch64InstrInfo *TII =
+      MF.getSubtarget<AArch64Subtarget>().getInstrInfo();
+  DebugLoc DL = MBB.findDebugLoc(MBBI);
+
+  MachineFunction::iterator MBBInsertPoint = std::next(MBB.getIterator());
+  MachineBasicBlock *LoopMBB = MF.CreateMachineBasicBlock(MBB.getBasicBlock());
+  MF.insert(MBBInsertPoint, LoopMBB);
+  MachineBasicBlock *ExitMBB = MF.CreateMachineBasicBlock(MBB.getBasicBlock());
+  MF.insert(MBBInsertPoint, ExitMBB);
+
+  // SUB SP, SP, #ProbeSize (or equivalent if ProbeSize is not encodable
+  // in SUB).
+  emitFrameOffset(*LoopMBB, LoopMBB->end(), DL, AArch64::SP, AArch64::SP,
+                  StackOffset::getFixed(-ProbeSize), TII,
+                  MachineInstr::FrameSetup);
+  // STR XZR, [SP]
+  BuildMI(*LoopMBB, LoopMBB->end(), DL, TII->get(AArch64::STRXui))
+      .addReg(AArch64::XZR)
+      .addReg(AArch64::SP)
+      .addImm(0)
+      .setMIFlags(MachineInstr::FrameSetup);
+  // CMP SP, TargetReg
+  BuildMI(*LoopMBB, LoopMBB->end(), DL, TII->get(AArch64::SUBSXrx64),
+          AArch64::XZR)
+      .addReg(AArch64::SP)
+      .addReg(TargetReg)
+      .addImm(AArch64_AM::getArithExtendImm(AArch64_AM::UXTX, 0))
+      .setMIFlags(MachineInstr::FrameSetup);
+  // B.CC Loop
+  BuildMI(*LoopMBB, LoopMBB->end(), DL, TII->get(AArch64::Bcc))
+      .addImm(AArch64CC::NE)
+      .addMBB(LoopMBB)
+      .setMIFlags(MachineInstr::FrameSetup);
+
+  LoopMBB->addSuccessor(ExitMBB);
+  LoopMBB->addSuccessor(LoopMBB);
+  // Synthesize the exit MBB.
+  ExitMBB->splice(ExitMBB->end(), &MBB, MBBI, MBB.end());
+  ExitMBB->transferSuccessorsAndUpdatePHIs(&MBB);
+  MBB.addSuccessor(LoopMBB);
+  // Update liveins.
+  recomputeLiveIns(*LoopMBB);
+  recomputeLiveIns(*ExitMBB);
+
+  return ExitMBB->begin();
+}
+
+MachineBasicBlock::iterator AArch64FrameLowering::inlineStackProbeFixed(
+    MachineBasicBlock::iterator MBBI, Register ScratchReg, int64_t FrameSize,
+    StackOffset CFAOffset) const {
+  MachineBasicBlock *MBB = MBBI->getParent();
+  MachineFunction &MF = *MBB->getParent();
+  const AArch64TargetLowering *TLI =
+      MF.getSubtarget<AArch64Subtarget>().getTargetLowering();
+  const AArch64InstrInfo *TII =
+      MF.getSubtarget<AArch64Subtarget>().getInstrInfo();
+  AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
+  bool EmitAsyncCFI = AFI->needsAsyncDwarfUnwindInfo(MF);
+  bool HasFP = hasFP(MF);
+
+  DebugLoc DL;
+  int64_t ProbeSize = TLI->getStackProbeSize(MF);
+  int64_t NumBlocks = FrameSize / ProbeSize;
+  int64_t ResidualSize = FrameSize % ProbeSize;
+
+  LLVM_DEBUG(dbgs() << "Stack probing: total " << FrameSize << " bytes, "
+                    << NumBlocks << " blocks of " << ProbeSize
+                    << " bytes, plus " << ResidualSize << " bytes\n");
+
+  // Decrement SP by NumBlock * ProbeSize bytes, with either unrolled or
+  // ordinary loop.
+  if (NumBlocks <= AArch64::StackProbeMaxLoopUnroll) {
+    for (int i = 0; i < NumBlocks; ++i) {
+      // SUB SP, SP, #FrameSize (or equivalent if FrameSize is not
+      // encodable in a SUB).
+      emitFrameOffset(*MBB, MBBI, DL, AArch64::SP, AArch64::SP,
+                      StackOffset::getFixed(-ProbeSize), TII,
+                      MachineInstr::FrameSetup, false, false, nullptr,
+                      EmitAsyncCFI && !HasFP, CFAOffset);
+      CFAOffset += StackOffset::getFixed(ProbeSize);
+      // STR XZR, [SP]
+      BuildMI(*MBB, MBBI, DL, TII->get(AArch64::STRXui))
+          .addReg(AArch64::XZR)
+          .addReg(AArch64::SP)
+          .addImm(0)
+          .setMIFlags(MachineInstr::FrameSetup);
+    }
+  } else if (NumBlocks != 0) {
+    // SUB ScratchReg, SP, #FrameSize (or equivalent if FrameSize is not
+    // encodable in ADD). ScrathReg may temporarily become the CFA register.
+    emitFrameOffset(*MBB, MBBI, DL, ScratchReg, AArch64::SP,
+                    StackOffset::getFixed(-ProbeSize * NumBlocks), TII,
+                    MachineInstr::FrameSetup, false, false, nullptr,
+                    EmitAsyncCFI && !HasFP, CFAOffset);
+    CFAOffset += StackOffset::getFixed(ProbeSize * NumBlocks);
+    MBBI = inlineStackProbeLoopExactMultiple(MBBI, ProbeSize, ScratchReg);
+    MBB = MBBI->getParent();
+    if (EmitAsyncCFI && !HasFP) {
+      // Set the CFA register back to SP.
+      const AArch64RegisterInfo &RegInfo =
+          *MF.getSubtarget<AArch64Subtarget>().getRegisterInfo();
+      unsigned Reg = RegInfo.getDwarfRegNum(AArch64::SP, true);
+      unsigned CFIIndex =
+          MF.addFrameInst(MCCFIInstruction::createDefCfaRegister(nullptr, Reg));
+      BuildMI(*MBB, MBBI, DL, TII->get(TargetOpcode::CFI_INSTRUCTION))
+          .addCFIIndex(CFIIndex)
+          .setMIFlags(MachineInstr::FrameSetup);
+    }
+  }
+
+  if (ResidualSize != 0) {
+    // SUB SP, SP, #ResidualSize (or equivalent if ResidualSize is not encodable
+    // in SUB).
+    emitFrameOffset(*MBB, MBBI, DL, AArch64::SP, AArch64::SP,
+                    StackOffset::getFixed(-ResidualSize), TII,
+                    MachineInstr::FrameSetup, false, false, nullptr,
+                    EmitAsyncCFI && !HasFP, CFAOffset);
+    if (ResidualSize > AArch64::StackProbeMaxUnprobedStack) {
+      // STR XZR, [SP]
+      BuildMI(*MBB, MBBI, DL, TII->get(AArch64::STRXui))
+          .addReg(AArch64::XZR)
+          .addReg(AArch64::SP)
+          .addImm(0)
+          .setMIFlags(MachineInstr::FrameSetup);
+    }
+  }
+
+  MachineBasicBlock::iterator Next = std::next(MBBI);
+  return Next;
+}
+
+MachineBasicBlock::iterator AArch64FrameLowering::inlineStackProbeFixed(
+    MachineBasicBlock::iterator MBBI) const {
+
+  Register ScratchReg = MBBI->getOperand(0).getReg();
+  int64_t FrameSize = MBBI->getOperand(1).getImm();
+  StackOffset CFAOffset = StackOffset::get(MBBI->getOperand(2).getImm(),
+                                           MBBI->getOperand(3).getImm());
+
+  MachineBasicBlock::iterator NextInst =
+      inlineStackProbeFixed(MBBI, ScratchReg, FrameSize, CFAOffset);
+
+  MBBI->eraseFromParent();
+  return NextInst;
+}
+
+MachineBasicBlock::iterator AArch64FrameLowering::inlineStackProbeVar(
+    MachineBasicBlock::iterator MBBI) const {
+  MachineBasicBlock &MBB = *MBBI->getParent();
+  MachineFunction &MF = *MBB.getParent();
+  const AArch64InstrInfo *TII =
+      MF.getSubtarget<AArch64Subtarget>().getInstrInfo();
+
+  DebugLoc DL = MBB.findDebugLoc(MBBI);
+  Register ScratchReg = MBBI->getOperand(0).getReg();
+  Register TargetReg = MBBI->getOperand(1).getReg();
+
+  MachineBasicBlock::iterator NextInst =
+      TII->insertStac...
[truncated]

momchil-velikov requested review from serge-sans-paille, bzEq, efriedma-quic, JonPsson, aemerson, zatrazz, SamTebbs33, hassnaaHamdi, davemgreen and CarolineConcatto

September 15, 2023 16:14

momchil-velikov mentioned this pull request

[AArch64] Stack probing for dynamic allocas in SelectionDAG #66525

Merged

momchil-velikov force-pushed the stack-clash-protection branch from 5e25a2f to a36e4b0 Compare

September 22, 2023 10:49

CarolineConcatto reviewed

View reviewed changes

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp Outdated Show resolved Hide resolved

momchil-velikov requested a review from sdesmalen-arm

September 22, 2023 12:35

momchil-velikov mentioned this pull request

[AArch64] Stack probing for dynamic allocas in GlobalISel #67123

Merged

Collaborator

efriedma-quic commented Sep 27, 2023

I'm still waiting for some reply to https://reviews.llvm.org/D158084#4593014 before I continue reviewing.

momchil-velikov force-pushed the stack-clash-protection branch from a36e4b0 to 6f714da Compare

October 6, 2023 07:59

Collaborator Author

momchil-velikov commented Oct 6, 2023

I'm still waiting for some reply to https://reviews.llvm.org/D158084#4593014 before I continue reviewing.

I've now refactored to share between Windows and non-Windows the part that determines the probe size and whether the stack probing is enabled.

momchil-velikov force-pushed the stack-clash-protection branch from 6f714da to 7b66d75 Compare

October 13, 2023 12:42

llvmbot added clang clang:driver clang:codegen labels

github-actions bot commented Oct 13, 2023 •

edited

Loading

✅ With the latest revision this PR passed the C/C++ code formatter.

momchil-velikov force-pushed the stack-clash-protection branch from 7b66d75 to 13285e8 Compare

October 13, 2023 14:36

jacobbramley mentioned this pull request

Support stack probing for arm64 (aarch64) rust-lang/rust#77071

Closed

Collaborator Author

momchil-velikov commented Oct 20, 2023

Ping?

Contributor

oskarwirga commented Oct 25, 2023

Upon function entry the caller guarantees that it has probed the stack (e.g. performed a store) at some address [sp, #N], where0 <= N <= 1024.

I haven't been able to produce a minimal, sharable example as of yet, but I'm encountering a runtime error associated with an inlined function where stack probing is active. The error manifests as a null pointer dereference, originating from a stack value that is probed (and set to 0) before being subsequently dereferenced.

The IR contributing to this runtime issue is somewhat complex and challenging to interpret, but here's my observations:

A value returned from malloc(some_struct) is stored in a stack variable.
This stack variable is passed as an argument to a function.
This function is later inlined, and within the inlined body, it attempts to set a value in the struct.
At runtime, when setting the value we get a null pointer dereference.

I'm working to isolate this issue and will share a repro ASAP. In the meantime, any insights or suggestions based on this description would be greatly appreciated.

Also is it required to write to the value? Would reading the value be sufficient?

Collaborator Author

momchil-velikov commented Oct 25, 2023 •

edited

Loading

I haven't been able to produce a minimal, sharable example as of yet, but I'm encountering a runtime error associated with an inlined function where stack probing is active. The error manifests as a null pointer dereference, originating from a stack value that is probed (and set to 0) before being subsequently dereferenced.

All the stack probing should have already finished before the call to malloc.

I'm working to isolate this issue and will share a repro ASAP. In the meantime, any insights or suggestions based on this description would be greatly appreciated.

Just to make things simpler, can you try disabling the shrink-wrapping and see what happens?

Also is it required to write to the value? Would reading the value be sufficient?

I can't really see a compelling reason to prefer one over another. Maybe for the odd chance some
kernel/runtime allocates read-only (as opposed to no access at all) guard regions.

Anyway, changing that won't solve the problem here, at most it could hide it.

Collaborator Author

momchil-velikov commented Oct 26, 2023

Apologies for still not being able to create a reproducible example I can share but what I am seeing is the stack probe write overwriting the value at the tip of the stack when I step debug execution:

Can you spot a place where the probe instruction is not immediately after a decrement of the stack (disregarding some random register-to-register arithmetic that may appear)?

If you can't (and you should not find such a case), perhaps some instruction writes below the sp? That would be against the AArch64.

Is redzone enabled (--aarch64-redzone=true|false) "? If it is on, can you check if it makes a difference if you disable it?

All the stack probing should have already finished before the call to malloc.

Only for the containing function, the functions which have their stack probes inlined will be in the middle of the function which then results in this null-deref.

The stack probes (except alloca ones) are emitted by the PrologEpilogInsertion pass, which is very late in the pipeline.
I didn't think inlining could happen after that pass. Can you share yous compilation pipeline (-- debug-pass=Structure)?

oskarwirga reviewed

View reviewed changes

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp Outdated Show resolved Hide resolved

momchil-velikov force-pushed the stack-clash-protection branch from 13285e8 to 1ba57e1 Compare

October 28, 2023 18:29

oskarwirga approved these changes

View reviewed changes

Contributor

oskarwirga left a comment

Testing this patch set on a complex application (including later PRs) yielded no issues :)

Thank you for your work on this, I appreciate it!

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp Outdated Show resolved Hide resolved

Collaborator Author

momchil-velikov commented Nov 11, 2023

I'm going to squash the commits which belong to this PR as I don't believe they are useful in isolation anymore and they get in the way of refactoring/rebasing.

(Long story short, I did a patch to avoid having two back to back probing loops, then factored out a stack probing independent part (to come in a separate review later) and rebasing this PR on top of it is a major pain).

momchil-velikov force-pushed the stack-clash-protection branch 2 times, most recently from c8c0bf5 to bbc0152 Compare

November 11, 2023 16:45

Collaborator Author

momchil-velikov commented Nov 13, 2023

In the last update:

do not set the FrameSetup flag for dynamic allocations
avoid back to back probing loops for allocation of SVE locals and non-SVE locals - fold the allocations together and emit a single loop (or no loop)

momchil-velikov force-pushed the stack-clash-protection branch from b1d9154 to ff16f79 Compare

November 14, 2023 17:13

Collaborator Author

momchil-velikov commented Nov 14, 2023

Latest update adds this patch: ff16f79

momchil-velikov force-pushed the stack-clash-protection branch from ff16f79 to 60e918b Compare

November 17, 2023 12:05

efriedma-quic reviewed

View reviewed changes

clang/lib/CodeGen/CodeGenModule.cpp Outdated Show resolved Hide resolved

llvm/test/CodeGen/AArch64/stack-probing-sve.ll Show resolved Hide resolved

momchil-velikov commented

View reviewed changes

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp Outdated Show resolved Hide resolved

clang/lib/CodeGen/CodeGenModule.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.cpp Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp Outdated Show resolved Hide resolved

Collaborator Author

momchil-velikov commented Nov 22, 2023 •

edited

Loading

I only now noticed I had a bunch of comments/replies sitting for a few weeks in "Pending" state :/

momchil-velikov force-pushed the stack-clash-protection branch from 60e918b to 1170c02 Compare

November 23, 2023 14:12

efriedma-quic approved these changes

View reviewed changes

Collaborator

efriedma-quic left a comment

LGTM

momchil-velikov force-pushed the stack-clash-protection branch from 1170c02 to e57df7b Compare

November 30, 2023 11:59

momchil-velikov and others added 5 commits

November 30, 2023 13:59


          [AArch64] Stack probing for function prologues

45213a1

This adds code to AArch64 function prologues to protect against stack
clash attacks by probing (writing to) the stack at regular enough
intervals to ensure that the guard page cannot be skipped over.

The patch depends on and maintains the following invariants:

* Upon function entry the caller guarantees that it has probed the stack
  (e.g. performed a store) at some address [sp, #N],
  where`0 <= N <= 1024`. This invariant comes from a requirement for
  compatibility with GCC.

* Any address range in the allocated stack, no smaller than
  stack-probe-size bytes contains at least one probe

* At any time the stack pointer is above or in the guard page

* Probes are performed in descreasing address order

The `stack-probe-size` is a function attribute that can be set by a
platform to correspond to the guard page size. By default, the stack
probe size is 4KiB, which is a safe default as this is the smallest
possible page size for AArch64. Linux uses a 64KiB guard for AArch64, so
this can be overridden by the `stack-probe-size` function attribute.

For small frames without a frame pointer (<= 240 bytes), no probes are
needed.

For larger frame sizes, LLVM always stores `x29` to the stack. This serves
as an implicit stack probe. Thus, while allocating stack objects the
compiler assumes that the stack has been probed at `[sp]`.

There are multiple probing sequences that can be emitted, depending on
the size of the stack allocation:

* A straight-line sequence of subtracts and stores, used when the
  allocation size is smaller than 5 guard pages.

* A loop allocating and probing one page size per iteration, plus at
  most a single probe to deal with the remainder, used when the
  allocation size is larger but still known at compile time.

* A loop which moves the SP down to the target value held in a register,
  used when the allocation size is not known at compile-time, such as
  when allocating space for SVE values, or when over-aligning the stack.
  This is emitted in AArch64InstrInfo because it will also be used for
  dynamic allocas in a future patch.

* A single probe where the amount of stack adjustment is unknown, but is
  known to be less than or equal to the page size.

Change-Id: Ib31bc23d2806835bc6da822599f2bd2ecc18e4bf
Co-authored-by: Oliver Stannard <[email protected]>


          Fix a failure to issue a probe when there can be more than 1024 unpro…

2278dd2

…bed bytes at top of stack


          Fix for a missing probe to the SVE callee-saved registers area

94f8691

In some cases a probe may be elided, resulting in more than 1024
unprobed area at the top of the stack or decrement of the SP by more
than a guard area size.


          Fix a crash when the probing instruction to replace is last in block

603d90a


          Add a couple of comments about a possible optmisation

e28870b

momchil-velikov force-pushed the stack-clash-protection branch from e57df7b to e28870b Compare

November 30, 2023 14:00

momchil-velikov merged commit cc944f5 into llvm:main

3 checks passed

efriedma-quic mentioned this pull request

[AArch64][Windows] Fix stack probing to match recent non-Windows improvements #73975

Open

oskarwirga reviewed

View reviewed changes

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

Comment on lines +9528 to +9533

+                //   STR XZR, [SP]
+                BuildMI(*ExitMBB, ExitMBB->end(), DL, TII->get(AArch64::STRXui))
+                    .addReg(AArch64::XZR)
+                    .addReg(AArch64::SP)
+                    .addImm(0)
+                    .setMIFlags(Flags);

Contributor

oskarwirga Dec 7, 2023

Hey @momchil-velikov , I'm encountering an issue where the stack probe is overwriting stack memory, similar to the issue I faced before. I apologize for not addressing this earlier, but backporting these changes introduced significant overhead.

The issue seems to be caused by this final stack probe instruction that is overwriting a stack value. I'm considering removing this probe instruction as a potential solution.

I don't anticipate this change will compromise security. Given our current stack probing strategy, we're always within one page of the most recent probe. Therefore, any subsequent instructions accessing memory at [sp] or above will either be valid or trigger a guard page fault.

I can't definitively confirm whether this issue is a result of backporting or an inherent problem until I upgrade to LLVM 18 which is some ways away.

Please let me know if my understanding is incorrect. I appreciate your work on this patchset and thank you for your assistance! :)

Collaborator

efriedma-quic Dec 7, 2023

Assuming TargetReg is lower than SP on entry at "LoopTest", the final store must be to an address lower than SP on entry to the sequence. (And since SP is always 16-byte aligned, it's impossible to have an issue with partial overlap.) Since that's freshly allocated memory, nothing should care what it contains.

I guess maybe weird things could happen if something tries to allocate 0 bytes of memory? Probably something that needs to be fixed, but it's unlikely you'd run into it from C code.

Also, maybe worth checking your code isn't depending on a value of an uninitialized variable.

Removing the final store completely breaks the guarantees the code is supposed to provide: if the allocation is less than the page size, the only store is the final store. So if the compiled code, for example, calls "alloca(1024)" in a loop, you can skip over an arbitrary number of pages.

Contributor

oskarwirga Dec 7, 2023 •

edited

Loading

Thanks for that explanation, yes I see it leaves a large gap. Theres like a million optimizations and mitigations slapped on this code so it gets difficult understanding the root cause.

I guess maybe weird things could happen if something tries to allocate 0 bytes of memory? Probably something that needs to be fixed, but it's unlikely you'd run into it from C code.

Thanks for this, it made me look into it a bit more and we're not allocating 0 bytes, but we are allocating 0x10 bytes which might cause some issues. I appreciate your quick reply thank you again!

EDIT: We are allocating 0 bytes lol I read it wrong. OK so thats whats the issue.

Contributor

oskarwirga Dec 8, 2023

Addressed this in #74806

momchil-velikov deleted the stack-clash-protection branch

January 30, 2024 10:35

siddhesh mentioned this pull request

Options -Wtrampolines and -fstack-clash-protection not universal ossf/wg-best-practices-os-developers#277

Closed

kinke mentioned this pull request

Bump LDC-LLVM to v18.1.8 ldc-developers/ldc#4712

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

efriedma-quic efriedma-quic approved these changes

CarolineConcatto CarolineConcatto left review comments

oskarwirga oskarwirga approved these changes

serge-sans-paille Awaiting requested review from serge-sans-paille

bzEq Awaiting requested review from bzEq

JonPsson Awaiting requested review from JonPsson

aemerson Awaiting requested review from aemerson

zatrazz Awaiting requested review from zatrazz

SamTebbs33 Awaiting requested review from SamTebbs33

hassnaaHamdi Awaiting requested review from hassnaaHamdi

davemgreen Awaiting requested review from davemgreen

sdesmalen-arm Awaiting requested review from sdesmalen-arm

Labels

backend:AArch64 clang:codegen clang:driver clang