-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] Stack probing for function prologues #66524
[AArch64] Stack probing for function prologues #66524
Conversation
@llvm/pr-subscribers-clang-driver @llvm/pr-subscribers-backend-aarch64 ChangesThis adds code to AArch64 function prologues to protect against stack clash attacks by probing (writing to) the stack at regular enough intervals to ensure that the guard page cannot be skipped over. The patch depends on and maintains the following invariants: Upon function entry the caller guarantees that it has probed the stack (e.g. performed a store) at some address [sp, #N], where By default, the stack probe size is 4KiB, which is a safe default as this is the smallest possible page size for AArch64. Linux uses a 64KiB guard for AArch64, so this can be overridden by the stack-probe-size function attribute. For small frames without a frame pointer (<= 240 bytes), no probes are needed. For larger frame sizes, LLVM always stores x29 to the stack. This serves as an implicit stack probe. Thus, while allocating stack objects the compiler assumes that the stack has been probed at [sp]. There are multiple probing sequences that can be emitted, depending on the size of the stack allocation: A straight-line sequence of subtracts and stores, used when the allocation size is smaller than 5 guard pages. A loop allocating and probing one page size per iteration, plus at most a single probe to deal with the remainder, used when the allocation size is larger but still known at compile time. A loop which moves the SP down to the target value held in a register (or a loop, moving a scratch register to the target value help in SP), used when the allocation size is not known at compile-time, such as when allocating space for SVE values, or when over-aligning the stack. This is emitted in AArch64InstrInfo because it will also be used for dynamic allocas in a future patch. A single probe where the amount of stack adjustment is unknown, but is known to be less than or equal to a page size. Patch is 100.06 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66524.diff 12 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
index 68e68449d4073b2..2bf61bdddf3d511 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -301,6 +301,7 @@ static bool produceCompactUnwindFrame(MachineFunction &MF);
static bool needsWinCFI(const MachineFunction &MF);
static StackOffset getSVEStackSize(const MachineFunction &MF);
static bool needsShadowCallStackPrologueEpilogue(MachineFunction &MF);
+static unsigned findScratchNonCalleeSaveRegister(MachineBasicBlock *MBB);
/// Returns true if a homogeneous prolog or epilog code can be emitted
/// for the size optimization. If possible, a frame helper call is injected.
@@ -672,6 +673,74 @@ void AArch64FrameLowering::emitCalleeSavedSVERestores(
emitCalleeSavedRestores(MBB, MBBI, true);
}
+void AArch64FrameLowering::allocateSVEStackSpace(
+ MachineBasicBlock &MBB, MachineBasicBlock::iterator MBBI,
+ StackOffset AllocSize, StackOffset InitialOffset, bool EmitCFI) const {
+ DebugLoc DL;
+ MachineFunction &MF = *MBB.getParent();
+ const AArch64Subtarget &Subtarget = MF.getSubtarget<AArch64Subtarget>();
+ const AArch64RegisterInfo &RegInfo = *Subtarget.getRegisterInfo();
+ const AArch64TargetLowering &TLI = *Subtarget.getTargetLowering();
+ const TargetInstrInfo &TII = *Subtarget.getInstrInfo();
+
+ // If not probing the stack or the (uknown) allocation size is less than the
+ // probe size decrement the stack pointer right away. This avoids having to
+ // emit a probing loop when allocating space for up to 16 SVE registers when
+ // using 4k probes.
+
+ // The bit-length of SVE registers is architecturally limited.
+ const int64_t MAX_BYTES_PER_SCALABLE_BYTE = 16;
+ int64_t ProbeSize = TLI.getStackProbeSize(MF);
+ if (!TLI.hasInlineStackProbe(MF) ||
+ AllocSize.getScalable() * MAX_BYTES_PER_SCALABLE_BYTE +
+ AllocSize.getFixed() <=
+ ProbeSize) {
+ emitFrameOffset(MBB, MBBI, DL, AArch64::SP, AArch64::SP, -AllocSize, &TII,
+ MachineInstr::FrameSetup, false, false, nullptr, EmitCFI,
+ InitialOffset);
+ if (TLI.hasInlineStackProbe(MF)) {
+ // Issue a probe at the top of the stack to prepare for subsequent
+ // allocations.
+ // STR XZR, [TargetReg]
+ BuildMI(MBB, MBBI, DL, TII.get(AArch64::STRXui))
+ .addReg(AArch64::XZR)
+ .addReg(AArch64::SP)
+ .addImm(0)
+ .setMIFlags(MachineInstr::FrameSetup);
+ }
+ return;
+ }
+
+ // If we can't be sure the allocation size if less than the probe size, we
+ // have to emit a stack probing loop.
+ Register ScratchReg = findScratchNonCalleeSaveRegister(&MBB);
+ assert(ScratchReg != AArch64::NoRegister);
+ // Get the new top of the stack into a scratch register.
+ emitFrameOffset(MBB, MBBI, DL, ScratchReg, AArch64::SP, -AllocSize, &TII,
+ MachineInstr::FrameSetup, false, false, nullptr, EmitCFI,
+ InitialOffset);
+ // Arrange to emit a probing loop by decrementing SP until it reaches that
+ // new top of the stack.
+ BuildMI(MBB, MBBI, DL, TII.get(AArch64::PROBED_STACKALLOC_VAR), AArch64::SP)
+ .addReg(ScratchReg);
+ // Set SP to its new value.
+ // MOV SP, Xs
+ BuildMI(MBB, MBBI, DL, TII.get(AArch64::ADDXri), AArch64::SP)
+ .addReg(ScratchReg)
+ .addImm(0)
+ .addImm(AArch64_AM::getShifterImm(AArch64_AM::LSL, 0))
+ .setMIFlags(MachineInstr::FrameSetup);
+ if (EmitCFI) {
+ // Set the CFA register back to SP.
+ unsigned Reg = RegInfo.getDwarfRegNum(AArch64::SP, true);
+ unsigned CFIIndex =
+ MF.addFrameInst(MCCFIInstruction::createDefCfaRegister(nullptr, Reg));
+ BuildMI(MBB, MBBI, DL, TII.get(TargetOpcode::CFI_INSTRUCTION))
+ .addCFIIndex(CFIIndex)
+ .setMIFlags(MachineInstr::FrameSetup);
+ }
+}
+
static MCRegister getRegisterOrZero(MCRegister Reg, bool HasSVE) {
switch (Reg.id()) {
default:
@@ -855,9 +924,11 @@ bool AArch64FrameLowering::canUseAsPrologue(
MachineBasicBlock *TmpMBB = const_cast<MachineBasicBlock *>(&MBB);
const AArch64Subtarget &Subtarget = MF->getSubtarget<AArch64Subtarget>();
const AArch64RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
+ const AArch64TargetLowering *TLI = Subtarget.getTargetLowering();
- // Don't need a scratch register if we're not going to re-align the stack.
- if (!RegInfo->hasStackRealignment(*MF))
+ // Don't need a scratch register if we're not going to re-align the stack or
+ // emit stack probes.
+ if (!RegInfo->hasStackRealignment(*MF) && TLI->hasInlineStackProbe(*MF))
return true;
// Otherwise, we can use any block as long as it has a scratch register
// available.
@@ -1429,6 +1500,7 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
const Function &F = MF.getFunction();
const AArch64Subtarget &Subtarget = MF.getSubtarget<AArch64Subtarget>();
const AArch64RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
+ const AArch64TargetLowering &TLI = *Subtarget.getTargetLowering();
const TargetInstrInfo *TII = Subtarget.getInstrInfo();
MachineModuleInfo &MMI = MF.getMMI();
AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
@@ -1784,12 +1856,14 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
}
}
- StackOffset AllocateBefore = SVEStackSize, AllocateAfter = {};
+ StackOffset SVECalleeSavedSize = {}, SVELocalsSize = SVEStackSize;
MachineBasicBlock::iterator CalleeSavesBegin = MBBI, CalleeSavesEnd = MBBI;
// Process the SVE callee-saves to determine what space needs to be
// allocated.
if (int64_t CalleeSavedSize = AFI->getSVECalleeSavedStackSize()) {
+ LLVM_DEBUG(dbgs() << "SVECalleeSavedStackSize = " << CalleeSavedSize
+ << "\n");
// Find callee save instructions in frame.
CalleeSavesBegin = MBBI;
assert(IsSVECalleeSave(CalleeSavesBegin) && "Unexpected instruction");
@@ -1797,33 +1871,40 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
++MBBI;
CalleeSavesEnd = MBBI;
- AllocateBefore = StackOffset::getScalable(CalleeSavedSize);
- AllocateAfter = SVEStackSize - AllocateBefore;
+ SVECalleeSavedSize = StackOffset::getScalable(CalleeSavedSize);
+ SVELocalsSize = SVEStackSize - SVECalleeSavedSize;
+
+ // Allocate space for the SVE callee saves.
+ if (SVECalleeSavedSize) {
+ allocateSVEStackSpace(
+ MBB, CalleeSavesBegin, SVECalleeSavedSize,
+ StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes),
+ EmitAsyncCFI && !HasFP);
+ if (EmitAsyncCFI)
+ emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
+ }
}
- // Allocate space for the callee saves (if any).
- emitFrameOffset(
- MBB, CalleeSavesBegin, DL, AArch64::SP, AArch64::SP, -AllocateBefore, TII,
- MachineInstr::FrameSetup, false, false, nullptr,
- EmitAsyncCFI && !HasFP && AllocateBefore,
- StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes));
-
- if (EmitAsyncCFI)
- emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
-
- // Finally allocate remaining SVE stack space.
- emitFrameOffset(MBB, CalleeSavesEnd, DL, AArch64::SP, AArch64::SP,
- -AllocateAfter, TII, MachineInstr::FrameSetup, false, false,
- nullptr, EmitAsyncCFI && !HasFP && AllocateAfter,
- AllocateBefore + StackOffset::getFixed(
- (int64_t)MFI.getStackSize() - NumBytes));
+ // Allocate stack space for the local SVE objects.
+ if (SVELocalsSize)
+ allocateSVEStackSpace(
+ MBB, CalleeSavesEnd, SVELocalsSize,
+ SVECalleeSavedSize +
+ StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes),
+ EmitAsyncCFI && !HasFP);
// Allocate space for the rest of the frame.
if (NumBytes) {
unsigned scratchSPReg = AArch64::SP;
+ bool NeedsStackProbe = TLI.hasInlineStackProbe(MF) &&
+ (NumBytes > AArch64::StackProbeMaxUnprobedStack ||
+ MFI.hasVarSizedObjects());
if (NeedsRealignment) {
scratchSPReg = findScratchNonCalleeSaveRegister(&MBB);
+ NeedsStackProbe |= TLI.hasInlineStackProbe(MF) &&
+ (NumBytes + MFI.getMaxAlign().value()) >
+ AArch64::StackProbeMaxUnprobedStack;
assert(scratchSPReg != AArch64::NoRegister);
}
@@ -1832,12 +1913,36 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
// FIXME: in the case of dynamic re-alignment, NumBytes doesn't have
// the correct value here, as NumBytes also includes padding bytes,
// which shouldn't be counted here.
- emitFrameOffset(
- MBB, MBBI, DL, scratchSPReg, AArch64::SP,
- StackOffset::getFixed(-NumBytes), TII, MachineInstr::FrameSetup,
- false, NeedsWinCFI, &HasWinCFI, EmitAsyncCFI && !HasFP,
+ StackOffset CFAOffset =
SVEStackSize +
- StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes));
+ StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes);
+ if (NeedsStackProbe && !NeedsRealignment) {
+ // If we don't need to re-align the stack, we can use a more efficient
+ // sequence for stack probing.
+ Register ScratchReg = findScratchNonCalleeSaveRegister(&MBB);
+ assert(ScratchReg != AArch64::NoRegister);
+ BuildMI(MBB, MBBI, DL, TII->get(AArch64::PROBED_STACKALLOC))
+ .addDef(ScratchReg)
+ .addImm(NumBytes)
+ .addImm(CFAOffset.getFixed())
+ .addImm(CFAOffset.getScalable());
+ // The fixed allocation may leave unprobed bytes at the top of the
+ // stack. If we have variable-sized objects, we need to issue an extra
+ // probe, so their allocations starts in a known state.
+ if (MFI.hasVarSizedObjects()) {
+ // STR XZR, [SP]
+ BuildMI(MBB, MBBI, DL, TII->get(AArch64::STRXui))
+ .addReg(AArch64::XZR)
+ .addReg(AArch64::SP)
+ .addImm(0)
+ .setMIFlags(MachineInstr::FrameSetup);
+ }
+ } else {
+ emitFrameOffset(MBB, MBBI, DL, scratchSPReg, AArch64::SP,
+ StackOffset::getFixed(-NumBytes), TII,
+ MachineInstr::FrameSetup, false, NeedsWinCFI,
+ &HasWinCFI, EmitAsyncCFI && !HasFP, CFAOffset);
+ }
}
if (NeedsRealignment) {
assert(MFI.getMaxAlign() > Align(1));
@@ -1846,12 +1951,48 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
// SUB X9, SP, NumBytes
// -- X9 is temporary register, so shouldn't contain any live data here,
// -- free to use. This is already produced by emitFrameOffset above.
- // AND SP, X9, 0b11111...0000
- uint64_t AndMask = ~(MFI.getMaxAlign().value() - 1);
- BuildMI(MBB, MBBI, DL, TII->get(AArch64::ANDXri), AArch64::SP)
- .addReg(scratchSPReg, RegState::Kill)
- .addImm(AArch64_AM::encodeLogicalImmediate(AndMask, 64));
+ const uint64_t MaxAlign = MFI.getMaxAlign().value();
+ const uint64_t AndMask = ~(MaxAlign - 1);
+
+ if (NeedsStackProbe) {
+ // If allocation size is known to not exceed the probe size, don't emit
+ // a probing loop.
+ if (NumBytes + MaxAlign - 1 <= TLI.getStackProbeSize(MF)) {
+ // AND SP, X9, 0b11111...0000
+ BuildMI(MBB, MBBI, DL, TII->get(AArch64::ANDXri), AArch64::SP)
+ .addReg(scratchSPReg, RegState::Kill)
+ .addImm(AArch64_AM::encodeLogicalImmediate(AndMask, 64))
+ .setMIFlags(MachineInstr::FrameSetup);
+ // STR XZR, [SP]
+ BuildMI(MBB, MBBI, DL, TII->get(AArch64::STRXui))
+ .addReg(AArch64::XZR)
+ .addReg(AArch64::SP)
+ .addImm(0)
+ .setMIFlags(MachineInstr::FrameSetup);
+ } else {
+ // AND X9, X9, 0b11111...0000
+ BuildMI(MBB, MBBI, DL, TII->get(AArch64::ANDXri), scratchSPReg)
+ .addReg(scratchSPReg, RegState::Kill)
+ .addImm(AArch64_AM::encodeLogicalImmediate(AndMask, 64))
+ .setMIFlags(MachineInstr::FrameSetup);
+ BuildMI(MBB, MBBI, DL, TII->get(AArch64::PROBED_STACKALLOC_VAR),
+ AArch64::SP)
+ .addReg(scratchSPReg);
+ // MOV SP, X9
+ BuildMI(MBB, MBBI, DL, TII->get(AArch64::ADDXri), AArch64::SP)
+ .addReg(scratchSPReg)
+ .addImm(0)
+ .addImm(AArch64_AM::getShifterImm(AArch64_AM::LSL, 0))
+ .setMIFlags(MachineInstr::FrameSetup);
+ }
+ } else {
+ // AND SP, X9, 0b11111...0000
+ BuildMI(MBB, MBBI, DL, TII->get(AArch64::ANDXri), AArch64::SP)
+ .addReg(scratchSPReg, RegState::Kill)
+ .addImm(AArch64_AM::encodeLogicalImmediate(AndMask, 64))
+ .setMIFlags(MachineInstr::FrameSetup);
+ }
AFI->setStackRealigned(true);
// No need for SEH instructions here; if we're realigning the stack,
@@ -4057,3 +4198,192 @@ void AArch64FrameLowering::orderFrameObjects(
dbgs() << "\n";
});
}
+
+/// Emit a loop to decrement SP until it is equal to TargetReg, with probes at
+/// least every ProbeSize bytes. Returns an iterator of the first instruction
+/// after the loop. The difference between SP and TargetReg must be an exact
+/// multiple of ProbeSize.
+MachineBasicBlock::iterator
+AArch64FrameLowering::inlineStackProbeLoopExactMultiple(
+ MachineBasicBlock::iterator MBBI, int64_t ProbeSize,
+ Register TargetReg) const {
+ MachineBasicBlock &MBB = *MBBI->getParent();
+ MachineFunction &MF = *MBB.getParent();
+ const AArch64InstrInfo *TII =
+ MF.getSubtarget<AArch64Subtarget>().getInstrInfo();
+ DebugLoc DL = MBB.findDebugLoc(MBBI);
+
+ MachineFunction::iterator MBBInsertPoint = std::next(MBB.getIterator());
+ MachineBasicBlock *LoopMBB = MF.CreateMachineBasicBlock(MBB.getBasicBlock());
+ MF.insert(MBBInsertPoint, LoopMBB);
+ MachineBasicBlock *ExitMBB = MF.CreateMachineBasicBlock(MBB.getBasicBlock());
+ MF.insert(MBBInsertPoint, ExitMBB);
+
+ // SUB SP, SP, #ProbeSize (or equivalent if ProbeSize is not encodable
+ // in SUB).
+ emitFrameOffset(*LoopMBB, LoopMBB->end(), DL, AArch64::SP, AArch64::SP,
+ StackOffset::getFixed(-ProbeSize), TII,
+ MachineInstr::FrameSetup);
+ // STR XZR, [SP]
+ BuildMI(*LoopMBB, LoopMBB->end(), DL, TII->get(AArch64::STRXui))
+ .addReg(AArch64::XZR)
+ .addReg(AArch64::SP)
+ .addImm(0)
+ .setMIFlags(MachineInstr::FrameSetup);
+ // CMP SP, TargetReg
+ BuildMI(*LoopMBB, LoopMBB->end(), DL, TII->get(AArch64::SUBSXrx64),
+ AArch64::XZR)
+ .addReg(AArch64::SP)
+ .addReg(TargetReg)
+ .addImm(AArch64_AM::getArithExtendImm(AArch64_AM::UXTX, 0))
+ .setMIFlags(MachineInstr::FrameSetup);
+ // B.CC Loop
+ BuildMI(*LoopMBB, LoopMBB->end(), DL, TII->get(AArch64::Bcc))
+ .addImm(AArch64CC::NE)
+ .addMBB(LoopMBB)
+ .setMIFlags(MachineInstr::FrameSetup);
+
+ LoopMBB->addSuccessor(ExitMBB);
+ LoopMBB->addSuccessor(LoopMBB);
+ // Synthesize the exit MBB.
+ ExitMBB->splice(ExitMBB->end(), &MBB, MBBI, MBB.end());
+ ExitMBB->transferSuccessorsAndUpdatePHIs(&MBB);
+ MBB.addSuccessor(LoopMBB);
+ // Update liveins.
+ recomputeLiveIns(*LoopMBB);
+ recomputeLiveIns(*ExitMBB);
+
+ return ExitMBB->begin();
+}
+
+MachineBasicBlock::iterator AArch64FrameLowering::inlineStackProbeFixed(
+ MachineBasicBlock::iterator MBBI, Register ScratchReg, int64_t FrameSize,
+ StackOffset CFAOffset) const {
+ MachineBasicBlock *MBB = MBBI->getParent();
+ MachineFunction &MF = *MBB->getParent();
+ const AArch64TargetLowering *TLI =
+ MF.getSubtarget<AArch64Subtarget>().getTargetLowering();
+ const AArch64InstrInfo *TII =
+ MF.getSubtarget<AArch64Subtarget>().getInstrInfo();
+ AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
+ bool EmitAsyncCFI = AFI->needsAsyncDwarfUnwindInfo(MF);
+ bool HasFP = hasFP(MF);
+
+ DebugLoc DL;
+ int64_t ProbeSize = TLI->getStackProbeSize(MF);
+ int64_t NumBlocks = FrameSize / ProbeSize;
+ int64_t ResidualSize = FrameSize % ProbeSize;
+
+ LLVM_DEBUG(dbgs() << "Stack probing: total " << FrameSize << " bytes, "
+ << NumBlocks << " blocks of " << ProbeSize
+ << " bytes, plus " << ResidualSize << " bytes\n");
+
+ // Decrement SP by NumBlock * ProbeSize bytes, with either unrolled or
+ // ordinary loop.
+ if (NumBlocks <= AArch64::StackProbeMaxLoopUnroll) {
+ for (int i = 0; i < NumBlocks; ++i) {
+ // SUB SP, SP, #FrameSize (or equivalent if FrameSize is not
+ // encodable in a SUB).
+ emitFrameOffset(*MBB, MBBI, DL, AArch64::SP, AArch64::SP,
+ StackOffset::getFixed(-ProbeSize), TII,
+ MachineInstr::FrameSetup, false, false, nullptr,
+ EmitAsyncCFI && !HasFP, CFAOffset);
+ CFAOffset += StackOffset::getFixed(ProbeSize);
+ // STR XZR, [SP]
+ BuildMI(*MBB, MBBI, DL, TII->get(AArch64::STRXui))
+ .addReg(AArch64::XZR)
+ .addReg(AArch64::SP)
+ .addImm(0)
+ .setMIFlags(MachineInstr::FrameSetup);
+ }
+ } else if (NumBlocks != 0) {
+ // SUB ScratchReg, SP, #FrameSize (or equivalent if FrameSize is not
+ // encodable in ADD). ScrathReg may temporarily become the CFA register.
+ emitFrameOffset(*MBB, MBBI, DL, ScratchReg, AArch64::SP,
+ StackOffset::getFixed(-ProbeSize * NumBlocks), TII,
+ MachineInstr::FrameSetup, false, false, nullptr,
+ EmitAsyncCFI && !HasFP, CFAOffset);
+ CFAOffset += StackOffset::getFixed(ProbeSize * NumBlocks);
+ MBBI = inlineStackProbeLoopExactMultiple(MBBI, ProbeSize, ScratchReg);
+ MBB = MBBI->getParent();
+ if (EmitAsyncCFI && !HasFP) {
+ // Set the CFA register back to SP.
+ const AArch64RegisterInfo &RegInfo =
+ *MF.getSubtarget<AArch64Subtarget>().getRegisterInfo();
+ unsigned Reg = RegInfo.getDwarfRegNum(AArch64::SP, true);
+ unsigned CFIIndex =
+ MF.addFrameInst(MCCFIInstruction::createDefCfaRegister(nullptr, Reg));
+ BuildMI(*MBB, MBBI, DL, TII->get(TargetOpcode::CFI_INSTRUCTION))
+ .addCFIIndex(CFIIndex)
+ .setMIFlags(MachineInstr::FrameSetup);
+ }
+ }
+
+ if (ResidualSize != 0) {
+ // SUB SP, SP, #ResidualSize (or equivalent if ResidualSize is not encodable
+ // in SUB).
+ emitFrameOffset(*MBB, MBBI, DL, AArch64::SP, AArch64::SP,
+ StackOffset::getFixed(-ResidualSize), TII,
+ MachineInstr::FrameSetup, false, false, nullptr,
+ EmitAsyncCFI && !HasFP, CFAOffset);
+ if (ResidualSize > AArch64::StackProbeMaxUnprobedStack) {
+ // STR XZR, [SP]
+ BuildMI(*MBB, MBBI, DL, TII->get(AArch64::STRXui))
+ .addReg(AArch64::XZR)
+ .addReg(AArch64::SP)
+ .addImm(0)
+ .setMIFlags(MachineInstr::FrameSetup);
+ }
+ }
+
+ MachineBasicBlock::iterator Next = std::next(MBBI);
+ return Next;
+}
+
+MachineBasicBlock::iterator AArch64FrameLowering::inlineStackProbeFixed(
+ MachineBasicBlock::iterator MBBI) const {
+
+ Register ScratchReg = MBBI->getOperand(0).getReg();
+ int64_t FrameSize = MBBI->getOperand(1).getImm();
+ StackOffset CFAOffset = StackOffset::get(MBBI->getOperand(2).getImm(),
+ MBBI->getOperand(3).getImm());
+
+ MachineBasicBlock::iterator NextInst =
+ inlineStackProbeFixed(MBBI, ScratchReg, FrameSize, CFAOffset);
+
+ MBBI->eraseFromParent();
+ return NextInst;
+}
+
+MachineBasicBlock::iterator AArch64FrameLowering::inlineStackProbeVar(
+ MachineBasicBlock::iterator MBBI) const {
+ MachineBasicBlock &MBB = *MBBI->getParent();
+ MachineFunction &MF = *MBB.getParent();
+ const AArch64InstrInfo *TII =
+ MF.getSubtarget<AArch64Subtarget>().getInstrInfo();
+
+ DebugLoc DL = MBB.findDebugLoc(MBBI);
+ Register ScratchReg = MBBI->getOperand(0).getReg();
+ Register TargetReg = MBBI->getOperand(1).getReg();
+
+ MachineBasicBlock::iterator NextInst =
+ TII->insertStac...
[truncated]
|
5e25a2f
to
a36e4b0
Compare
I'm still waiting for some reply to https://reviews.llvm.org/D158084#4593014 before I continue reviewing. |
a36e4b0
to
6f714da
Compare
I've now refactored to share between Windows and non-Windows the part that determines the probe size and whether the stack probing is enabled. |
6f714da
to
7b66d75
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
7b66d75
to
13285e8
Compare
Ping? |
I haven't been able to produce a minimal, sharable example as of yet, but I'm encountering a runtime error associated with an inlined function where stack probing is active. The error manifests as a null pointer dereference, originating from a stack value that is probed (and set to 0) before being subsequently dereferenced. The IR contributing to this runtime issue is somewhat complex and challenging to interpret, but here's my observations:
I'm working to isolate this issue and will share a repro ASAP. In the meantime, any insights or suggestions based on this description would be greatly appreciated. Also is it required to write to the value? Would reading the value be sufficient? |
All the stack probing should have already finished before the call to
Just to make things simpler, can you try disabling the shrink-wrapping and see what happens?
I can't really see a compelling reason to prefer one over another. Maybe for the odd chance some Anyway, changing that won't solve the problem here, at most it could hide it. |
Can you spot a place where the probe instruction is not immediately after a decrement of the stack (disregarding some random register-to-register arithmetic that may appear)? If you can't (and you should not find such a case), perhaps some instruction writes below the Is redzone enabled (
The stack probes (except |
13285e8
to
1ba57e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this patch set on a complex application (including later PRs) yielded no issues :)
Thank you for your work on this, I appreciate it!
I'm going to squash the commits which belong to this PR as I don't believe they are useful in isolation anymore and they get in the way of refactoring/rebasing. (Long story short, I did a patch to avoid having two back to back probing loops, then factored out a stack probing independent part (to come in a separate review later) and rebasing this PR on top of it is a major pain). |
c8c0bf5
to
bbc0152
Compare
In the last update:
|
b1d9154
to
ff16f79
Compare
Latest update adds this patch: ff16f79 |
ff16f79
to
60e918b
Compare
I only now noticed I had a bunch of comments/replies sitting for a few weeks in "Pending" state :/ |
60e918b
to
1170c02
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1170c02
to
e57df7b
Compare
This adds code to AArch64 function prologues to protect against stack clash attacks by probing (writing to) the stack at regular enough intervals to ensure that the guard page cannot be skipped over. The patch depends on and maintains the following invariants: * Upon function entry the caller guarantees that it has probed the stack (e.g. performed a store) at some address [sp, #N], where`0 <= N <= 1024`. This invariant comes from a requirement for compatibility with GCC. * Any address range in the allocated stack, no smaller than stack-probe-size bytes contains at least one probe * At any time the stack pointer is above or in the guard page * Probes are performed in descreasing address order The `stack-probe-size` is a function attribute that can be set by a platform to correspond to the guard page size. By default, the stack probe size is 4KiB, which is a safe default as this is the smallest possible page size for AArch64. Linux uses a 64KiB guard for AArch64, so this can be overridden by the `stack-probe-size` function attribute. For small frames without a frame pointer (<= 240 bytes), no probes are needed. For larger frame sizes, LLVM always stores `x29` to the stack. This serves as an implicit stack probe. Thus, while allocating stack objects the compiler assumes that the stack has been probed at `[sp]`. There are multiple probing sequences that can be emitted, depending on the size of the stack allocation: * A straight-line sequence of subtracts and stores, used when the allocation size is smaller than 5 guard pages. * A loop allocating and probing one page size per iteration, plus at most a single probe to deal with the remainder, used when the allocation size is larger but still known at compile time. * A loop which moves the SP down to the target value held in a register, used when the allocation size is not known at compile-time, such as when allocating space for SVE values, or when over-aligning the stack. This is emitted in AArch64InstrInfo because it will also be used for dynamic allocas in a future patch. * A single probe where the amount of stack adjustment is unknown, but is known to be less than or equal to the page size. Change-Id: Ib31bc23d2806835bc6da822599f2bd2ecc18e4bf Co-authored-by: Oliver Stannard <[email protected]>
…bed bytes at top of stack
In some cases a probe may be elided, resulting in more than 1024 unprobed area at the top of the stack or decrement of the SP by more than a guard area size.
e57df7b
to
e28870b
Compare
// STR XZR, [SP] | ||
BuildMI(*ExitMBB, ExitMBB->end(), DL, TII->get(AArch64::STRXui)) | ||
.addReg(AArch64::XZR) | ||
.addReg(AArch64::SP) | ||
.addImm(0) | ||
.setMIFlags(Flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @momchil-velikov , I'm encountering an issue where the stack probe is overwriting stack memory, similar to the issue I faced before. I apologize for not addressing this earlier, but backporting these changes introduced significant overhead.
The issue seems to be caused by this final stack probe instruction that is overwriting a stack value. I'm considering removing this probe instruction as a potential solution.
I don't anticipate this change will compromise security. Given our current stack probing strategy, we're always within one page of the most recent probe. Therefore, any subsequent instructions accessing memory at [sp]
or above will either be valid or trigger a guard page fault.
I can't definitively confirm whether this issue is a result of backporting or an inherent problem until I upgrade to LLVM 18 which is some ways away.
Please let me know if my understanding is incorrect. I appreciate your work on this patchset and thank you for your assistance! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming TargetReg is lower than SP on entry at "LoopTest", the final store must be to an address lower than SP on entry to the sequence. (And since SP is always 16-byte aligned, it's impossible to have an issue with partial overlap.) Since that's freshly allocated memory, nothing should care what it contains.
I guess maybe weird things could happen if something tries to allocate 0 bytes of memory? Probably something that needs to be fixed, but it's unlikely you'd run into it from C code.
Also, maybe worth checking your code isn't depending on a value of an uninitialized variable.
Removing the final store completely breaks the guarantees the code is supposed to provide: if the allocation is less than the page size, the only store is the final store. So if the compiled code, for example, calls "alloca(1024)" in a loop, you can skip over an arbitrary number of pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that explanation, yes I see it leaves a large gap. Theres like a million optimizations and mitigations slapped on this code so it gets difficult understanding the root cause.
I guess maybe weird things could happen if something tries to allocate 0 bytes of memory? Probably something that needs to be fixed, but it's unlikely you'd run into it from C code.
Thanks for this, it made me look into it a bit more and we're not allocating 0 bytes, but we are allocating 0x10 bytes which might cause some issues. I appreciate your quick reply thank you again!
EDIT: We are allocating 0 bytes lol I read it wrong. OK so thats whats the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this in #74806
This adds code to AArch64 function prologues to protect against stack clash attacks by probing (writing to) the stack at regular enough intervals to ensure that the guard page cannot be skipped over.
The patch depends on and maintains the following invariants:
Upon function entry the caller guarantees that it has probed the stack (e.g. performed a store) at some address [sp, #N], where
0 <= N <= 1024
. This invariant comes from a requirement for compatibility with GCC. Any address range in the allocated stack, no smaller than stack-probe-size bytes contains at least one probe At any time the stack pointer is above or in the guard page Probes are performed in descreasing address orderThe stack-probe-size is a function attribute that can be set by a platform to correspond to the guard page size.
By default, the stack probe size is 4KiB, which is a safe default as this is the smallest possible page size for AArch64. Linux uses a 64KiB guard for AArch64, so this can be overridden by the stack-probe-size function attribute.
For small frames without a frame pointer (<= 240 bytes), no probes are needed.
For larger frame sizes, LLVM always stores x29 to the stack. This serves as an implicit stack probe. Thus, while allocating stack objects the compiler assumes that the stack has been probed at [sp].
There are multiple probing sequences that can be emitted, depending on the size of the stack allocation:
A straight-line sequence of subtracts and stores, used when the allocation size is smaller than 5 guard pages. A loop allocating and probing one page size per iteration, plus at most a single probe to deal with the remainder, used when the allocation size is larger but still known at compile time. A loop which moves the SP down to the target value held in a register (or a loop, moving a scratch register to the target value help in SP), used when the allocation size is not known at compile-time, such as when allocating space for SVE values, or when over-aligning the stack. This is emitted in AArch64InstrInfo because it will also be used for dynamic allocas in a future patch. A single probe where the amount of stack adjustment is unknown, but is known to be less than or equal to a page size.