Skip to content

feat: add memory pressure monitor with launchd integration#2792

Closed
johnwaldo wants to merge 2 commits intomarcusquinn:mainfrom
johnwaldo:feature/memory-pressure-monitor
Closed

feat: add memory pressure monitor with launchd integration#2792
johnwaldo wants to merge 2 commits intomarcusquinn:mainfrom
johnwaldo:feature/memory-pressure-monitor

Conversation

@johnwaldo
Copy link
Contributor

@johnwaldo johnwaldo commented Mar 3, 2026

Summary

  • Adds memory-pressure-monitor.sh — early warning system for macOS memory pressure
  • Monitors kern.memorystatus_level, swap file count, and compressor segments
  • Sends desktop notifications (via terminal-notifier or osascript) at warning (<=40%) and critical (<=20%) thresholds
  • Includes top memory consumers in alerts and logs for quick triage

Motivation

Kernel panic on March 3 caused by memory exhaustion cascade: 46 swap files, 100% compressor segment limit, watchdog timeout after 94 seconds. The system had no early warning — by the time pressure was visible, it was too late.

Features

Command Purpose
--check Single check (default, for launchd)
--status Print current memory state and monitor status
--daemon Continuous monitoring (30s interval)
--install Install launchd plist (sh.aidevops.memory-pressure-monitor)
--uninstall Remove launchd plist and state files

Design

  • Primary signal: kern.memorystatus_level (kernel-maintained, instant read, 0-100)
  • Cooldown: 5-minute notification cooldown per severity level to avoid spam
  • Recovery tracking: Logs when pressure resolves after an alert
  • Low overhead: ProcessType: Background, Nice: 10, LowPriorityBackgroundIO: true in launchd plist
  • Zero dependencies: Uses only macOS builtins + optional terminal-notifier

Testing

  • ShellCheck clean
  • Tested: --status, --check (normal path), --check (triggered warning/critical via threshold override), --help
  • Verified: log output, notification delivery, cooldown behavior, recovery detection

Summary by CodeRabbit

  • New Features
    • Added a macOS memory pressure monitor that provides early warnings and status reporting
    • Three modes: single check, continuous daemon, and status view
    • Install/uninstall support for a background agent to run periodic checks
    • Configurable thresholds, cooldowns, log paths, and notification toggles
    • Detailed logging and human-readable summaries of memory state and top consumers
    • Sends alerts via preferred notifier with a fallback notification method

Early warning system for macOS memory pressure that monitors
kern.memorystatus_level, swap state, and compressor segments.
Sends desktop notifications at warning (<=40%) and critical (<=20%)
thresholds with top memory consumers. Includes --install for launchd
(30s interval), --daemon mode, and --status for quick checks.

Motivated by a kernel panic caused by memory exhaustion cascade
(46 swap files, 100% compressor segment limit) under heavy
Electron + opencode workloads.
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical system monitoring tool for macOS, aimed at preventing kernel panics caused by memory exhaustion. By proactively tracking memory pressure indicators and providing timely, detailed notifications, it empowers users to intervene before system stability is compromised. The integration with launchd ensures continuous, efficient background operation, making it a robust solution for maintaining system health.

Highlights

  • New Memory Pressure Monitor: Introduced memory-pressure-monitor.sh, a new shell script designed to act as an early warning system for macOS memory pressure.
  • Comprehensive Monitoring: The script monitors key macOS memory metrics including kern.memorystatus_level, swap file count, and compressor segments to detect impending memory exhaustion.
  • Threshold-Based Notifications: Desktop notifications are sent at configurable warning (<=40% free memory) and critical (<=20% free memory) thresholds, utilizing terminal-notifier or osascript.
  • Diagnostic Information: Alerts and logs include a list of top memory-consuming processes to aid in quick diagnosis and resolution.
  • Launchd Integration: The script includes commands to install and uninstall a launchd plist, enabling continuous, low-overhead background monitoring at 30-second intervals.
  • Low Overhead and Dependencies: Designed for minimal system impact with ProcessType: Background, Nice: 10, and LowPriorityBackgroundIO: true in its launchd configuration, and uses only macOS built-ins with terminal-notifier as an optional enhancement.
Changelog
  • .agents/scripts/memory-pressure-monitor.sh
    • Added a new bash script to monitor macOS memory pressure.
    • Implemented logic for warning and critical memory thresholds.
    • Integrated desktop notification capabilities using terminal-notifier or osascript.
    • Included functionality to log and display top memory-consuming processes.
    • Developed commands for launchd installation, uninstallation, status checks, and daemon mode.
Activity
  • No activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 3, 2026

Walkthrough

Adds a new Bash agent script that monitors macOS memory pressure, swap and VM compressor state. Provides single-check, daemonized monitoring (LaunchAgent), and status modes; emits notifications (terminal-notifier or osascript) with per-level cooldowns, logging, install/uninstall of a LaunchAgent plist, and configurable thresholds.

Changes

Cohort / File(s) Summary
Memory Pressure Monitoring Agent
.agents/scripts/memory-pressure-monitor.sh
New comprehensive Bash script: collects kern.memorystatus_level, swap and compressor stats; supports check/monitor/status modes, install/uninstall of a LaunchAgent, cooldowned notifications, logging of top memory consumers, configurable thresholds and paths.

Sequence Diagram(s)

sequenceDiagram
    participant Launchd as Launchd Scheduler
    participant Script as memory-pressure-monitor.sh
    participant Kernel as macOS Kernel
    participant Notifier as Notification Agent
    participant Logger as Log File

    Launchd->>Script: Run (daemon mode / schedule)
    Script->>Kernel: Query kern.memorystatus_level, swap, compressor, VM stats
    Kernel-->>Script: Return memory metrics
    Script->>Script: Evaluate thresholds & cooldowns
    alt Threshold exceeded & cooldown expired
        Script->>Notifier: Send notification (terminal-notifier or osascript)
        Notifier-->>Script: Delivery result
        Script->>Script: Update cooldown timestamp
    end
    Script->>Logger: Append metrics and top consumers
    Logger-->>Script: Confirm write
    Script-->>Launchd: Exit/status
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🧠 A tiny guardian watches RAM's deep sea,
It counts the swaps and notes compressor plea,
Launchd hums steady while alerts softly sing,
Cooldowns keep the clatter — peace monitoring.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a new memory pressure monitor script with launchd integration, which is the primary focus of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.agents/scripts/memory-pressure-monitor.sh (1)

289-298: Daemon mode works but could benefit from signal handling for cleaner shutdown.

The infinite loop is functional, but adding a trap for SIGTERM/SIGINT would improve observability by logging when the daemon stops. This is optional since launchd manages the lifecycle anyway.

✨ Optional: Add signal trap for graceful shutdown logging
 cmd_daemon() {
 	echo "[${SCRIPT_NAME}] Starting daemon mode (interval: ${DAEMON_INTERVAL}s)"
 	echo "[${SCRIPT_NAME}] Thresholds: warn=${WARN_THRESHOLD}%, crit=${CRIT_THRESHOLD}%"
 	echo "[${SCRIPT_NAME}] Press Ctrl+C to stop"
+	trap 'echo "[${SCRIPT_NAME}] Shutting down"; exit 0' SIGTERM SIGINT

 	while true; do
 		do_check
 		sleep "${DAEMON_INTERVAL}"
 	done
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/memory-pressure-monitor.sh around lines 289 - 298, Add a
signal trap in cmd_daemon to handle SIGINT and SIGTERM so the daemon logs a
graceful shutdown and exits cleanly: inside cmd_daemon (which currently contains
the infinite while true loop calling do_check and sleeping DAEMON_INTERVAL), set
trap handlers for SIGINT and SIGTERM that echo a shutdown message including
SCRIPT_NAME and then exit, and ensure the trap is installed before entering the
loop so the daemon logs when it stops.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/memory-pressure-monitor.sh:
- Around line 116-120: get_memory_level currently returns "0" on sysctl failure
which is indistinguishable from an actual critical reading; change
get_memory_level to capture the sysctl output and exit status, validate the
output is a numeric 0-100 value (e.g., via a regex or awk), and only echo the
numeric value when valid; on failure or invalid output echo an empty string (or
a sentinel like "unknown") instead of "0" and optionally emit a warning to
stderr so callers can treat non-numeric/empty as "unavailable" rather than
critical.

---

Nitpick comments:
In @.agents/scripts/memory-pressure-monitor.sh:
- Around line 289-298: Add a signal trap in cmd_daemon to handle SIGINT and
SIGTERM so the daemon logs a graceful shutdown and exits cleanly: inside
cmd_daemon (which currently contains the infinite while true loop calling
do_check and sleeping DAEMON_INTERVAL), set trap handlers for SIGINT and SIGTERM
that echo a shutdown message including SCRIPT_NAME and then exit, and ensure the
trap is installed before entering the loop so the daemon logs when it stops.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72e40ca and 338513e.

📒 Files selected for processing (1)
  • .agents/scripts/memory-pressure-monitor.sh

Comment on lines +116 to +120
get_memory_level() {
# kern.memorystatus_level: 0-100, percentage of memory considered "free"
# by the kernel (includes purgeable, inactive, free pages)
sysctl -n kern.memorystatus_level 2>/dev/null || echo "0"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fallback to "0" on sysctl failure could trigger false critical alerts.

If sysctl -n kern.memorystatus_level fails (permissions, kernel issue, etc.), returning "0" will be interpreted as critical memory pressure, causing spurious notifications. Consider differentiating between failure and actual low memory.

🛡️ Proposed fix to handle sysctl failure gracefully
 get_memory_level() {
 	# kern.memorystatus_level: 0-100, percentage of memory considered "free"
 	# by the kernel (includes purgeable, inactive, free pages)
-	sysctl -n kern.memorystatus_level 2>/dev/null || echo "0"
+	local level
+	if ! level="$(sysctl -n kern.memorystatus_level 2>/dev/null)" || [[ -z "${level}" ]]; then
+		log_msg "ERROR" "Failed to read kern.memorystatus_level"
+		echo "100"  # Fail safe: assume normal to avoid false alerts
+		return 1
+	fi
+	echo "${level}"
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/memory-pressure-monitor.sh around lines 116 - 120,
get_memory_level currently returns "0" on sysctl failure which is
indistinguishable from an actual critical reading; change get_memory_level to
capture the sysctl output and exit status, validate the output is a numeric
0-100 value (e.g., via a regex or awk), and only echo the numeric value when
valid; on failure or invalid output echo an empty string (or a sentinel like
"unknown") instead of "0" and optionally emit a warning to stderr so callers can
treat non-numeric/empty as "unavailable" rather than critical.

Halves launchd process churn with no practical loss in detection time.
The memory pressure cascade that caused the kernel panic took minutes
to develop — 60s granularity is more than sufficient.
@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-designed and valuable memory pressure monitor for macOS. However, it contains several security vulnerabilities, including command injection via bash arithmetic evaluation, XML injection in the generated launchd plist, and AppleScript injection in the notification function. Additionally, the script frequently suppresses command errors using 2>/dev/null, which hinders debugging and goes against project rules for shell scripting. These issues should be addressed by validating environment variables, properly escaping data, and ensuring errors are visible for diagnostics.


# Fallback: osascript
if command -v osascript &>/dev/null; then
osascript -e "display notification \"${message}\" with title \"Memory Pressure: ${title}\"" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The notify function is vulnerable to AppleScript injection because message and title are passed to osascript without proper escaping, allowing arbitrary commands if they contain double quotes. The provided suggestion addresses the injection and removes the 2>/dev/null suppression, ensuring potential osascript errors are visible for troubleshooting, in line with our guidelines.

Suggested change
osascript -e "display notification \"${message}\" with title \"Memory Pressure: ${title}\"" 2>/dev/null || true
local escaped_message="${message//"/\"}"
local escaped_title="${title//"/\"}"
osascript -e "display notification \"${escaped_message}\" with title \"Memory Pressure: ${escaped_title}\"" || true
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Comment on lines +32 to +34
readonly WARN_THRESHOLD="${MEMORY_WARN_THRESHOLD:-40}"
readonly CRIT_THRESHOLD="${MEMORY_CRIT_THRESHOLD:-20}"
readonly COOLDOWN_SECS="${MEMORY_COOLDOWN_SECS:-300}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The script uses environment variables (MEMORY_WARN_THRESHOLD, MEMORY_CRIT_THRESHOLD, MEMORY_COOLDOWN_SECS) directly in bash arithmetic evaluations (e.g., [[ ... -le ... ]] on lines 187 and 189, and $(( ... )) on line 101). Bash performs arithmetic expansion on these operands, which can lead to arbitrary command execution if the variables contain malicious expressions like a[$(id)]. It is recommended to validate that these variables contain only numeric values before assigning them to the readonly variables.

Suggested change
readonly WARN_THRESHOLD="${MEMORY_WARN_THRESHOLD:-40}"
readonly CRIT_THRESHOLD="${MEMORY_CRIT_THRESHOLD:-20}"
readonly COOLDOWN_SECS="${MEMORY_COOLDOWN_SECS:-300}"
readonly WARN_THRESHOLD=$(echo "${MEMORY_WARN_THRESHOLD:-40}" | grep -E '^[0-9]+$' || echo 40)
readonly CRIT_THRESHOLD=$(echo "${MEMORY_CRIT_THRESHOLD:-20}" | grep -E '^[0-9]+$' || echo 20)
readonly COOLDOWN_SECS=$(echo "${MEMORY_COOLDOWN_SECS:-300}" | grep -E '^[0-9]+$' || echo 300)

readonly NOTIFY_ENABLED="${MEMORY_NOTIFY:-true}"
readonly DAEMON_INTERVAL=60

readonly LOG_DIR="${MEMORY_LOG_DIR:-${HOME}/.aidevops/logs}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The LOG_DIR variable, derived from the MEMORY_LOG_DIR environment variable, is used to construct file paths and is embedded directly into a launchd plist XML template on line 325. An attacker could provide a value containing XML tags (e.g., </string><key>ProgramArguments</key>...) to alter the structure of the plist and execute arbitrary commands when the agent is started by launchd. It is recommended to sanitize the input to remove characters that could break the XML structure.

Suggested change
readonly LOG_DIR="${MEMORY_LOG_DIR:-${HOME}/.aidevops/logs}"
readonly LOG_DIR=$(echo "${MEMORY_LOG_DIR:-${HOME}/.aidevops/logs}" | tr -d '<>&')

-message "${message}" \
-sound "${sound}" \
-group "${SCRIPT_NAME}" \
-sender "com.apple.ActivityMonitor" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of 2>/dev/null here suppresses all error output from terminal-notifier. While || true correctly prevents the script from exiting on failure, hiding the error message makes it difficult to diagnose why notifications might not be working (e.g., a misconfigured terminal-notifier, permission issues). In accordance with our general rule to avoid blanket error suppression, it's better to let error messages be visible for debugging.

Suggested change
-sender "com.apple.ActivityMonitor" 2>/dev/null || true
-sender "com.apple.ActivityMonitor" || true
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

local cooldown_file="$1"
if [[ -f "${cooldown_file}" ]]; then
local last_notify
last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This use of 2>/dev/null violates our rule against suppressing errors on file operations when the file's existence has already been checked. The [[ -f ... ]] check on line 96 makes this redundant for "file not found" errors, and it can mask other important issues like file permission problems. The || echo 0 fallback is sufficient to handle failures gracefully.

Suggested change
last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)"
last_notify="$(cat "${cooldown_file}" || echo 0)"
References
  1. Avoid using 2>/dev/null to suppress errors on file operations if the file's existence has already been verified by a preceding check (e.g., [[ -f "$file" ]] or an early return). This practice is redundant for 'file not found' errors and can mask other important issues like permissions problems.

get_memory_level() {
# kern.memorystatus_level: 0-100, percentage of memory considered "free"
# by the kernel (includes purgeable, inactive, free pages)
sysctl -n kern.memorystatus_level 2>/dev/null || echo "0"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Suppressing stderr with 2>/dev/null here hides potentially important diagnostic information if sysctl fails (e.g., due to permissions or if the kernel key changes in a future macOS version). The || echo "0" construct already ensures the script continues with a default value, so the error suppression is not needed for script stability and is detrimental to debuggability, per our project guidelines. This same issue applies to lines 124, 131, and 139.

Suggested change
sysctl -n kern.memorystatus_level 2>/dev/null || echo "0"
sysctl -n kern.memorystatus_level || echo "0"
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.


get_swap_file_count() {
local count
count="$(find /private/var/vm -name 'swapfile*' 2>/dev/null | wc -l | tr -d ' ')"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using 2>/dev/null with find can hide important errors, such as "Permission denied" when traversing /private/var/vm. As per our guidelines, such errors should remain visible for debugging. Since stderr is separate from stdout, error messages will not interfere with the wc -l pipe, so there is no need to suppress them.

Suggested change
count="$(find /private/var/vm -name 'swapfile*' 2>/dev/null | wc -l | tr -d ' ')"
count="$(find /private/var/vm -name 'swapfile*' | wc -l | tr -d ' ')"
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

# ps output: RSS (KB), %MEM, COMMAND — sorted by RSS descending
# Use process substitution to avoid SIGPIPE from head closing the pipe
local ps_output
ps_output="$(ps -eo rss=,pmem=,comm= -r 2>/dev/null | head -n "${count}" || true)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Blanket error suppression with 2>/dev/null on the ps command should be avoided. If ps fails, it's important to see the error to understand why the script isn't reporting top memory consumers. The || true at the end of the pipeline already prevents the script from exiting due to set -e, so removing the redirection is safe and improves debuggability.

Suggested change
ps_output="$(ps -eo rss=,pmem=,comm= -r 2>/dev/null | head -n "${count}" || true)"
ps_output="$(ps -eo rss=,pmem=,comm= -r | head -n "${count}" || true)"
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

</plist>
PLIST

launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The 2>/dev/null here suppresses all errors from launchctl bootout. While it's intended to ignore the "service not found" error when the agent isn't installed, it will also hide other potentially critical errors (e.g., permission issues talking to launchd). The || true is sufficient to prevent the script from exiting. Letting errors be visible makes the script more robust and easier to debug. A similar issue exists on line 360.

Suggested change
launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true
launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" || true
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
.agents/scripts/memory-pressure-monitor.sh (1)

116-120: ⚠️ Potential issue | 🟠 Major

Avoid mapping telemetry read failures to a critical memory value

Line 119 returning "0" on sysctl failure is indistinguishable from genuine critical pressure and can trigger false critical alerts. Return an explicit “unavailable” sentinel and skip severity evaluation for that cycle.

Suggested fix
 get_memory_level() {
 	# kern.memorystatus_level: 0-100, percentage of memory considered "free"
 	# by the kernel (includes purgeable, inactive, free pages)
-	sysctl -n kern.memorystatus_level 2>/dev/null || echo "0"
+	local level
+	if ! level="$(sysctl -n kern.memorystatus_level 2>/dev/null)"; then
+		log_msg "WARN" "Unable to read kern.memorystatus_level"
+		echo ""
+		return 1
+	fi
+	if [[ ! "${level}" =~ ^[0-9]+$ ]] || (( level < 0 || level > 100 )); then
+		log_msg "WARN" "Invalid kern.memorystatus_level value: ${level}"
+		echo ""
+		return 1
+	fi
+	echo "${level}"
 }
# also guard in do_check before evaluate_state
level="$(echo "${state}" | cut -d'|' -f1)"
if [[ -z "${level}" ]]; then
  log_msg "WARN" "Memory level unavailable; skipping this check cycle"
  return 0
fi

As per coding guidelines, .agents/scripts/*.sh: Automation scripts should emphasize reliability/robustness and error recovery mechanisms.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/memory-pressure-monitor.sh around lines 116 - 120, The
get_memory_level function should not return "0" on sysctl failure because that
mimics critical memory pressure; change get_memory_level to return an explicit
empty/unavailable sentinel (e.g., empty string) on failure instead of "0", and
then update the caller (do_check before calling evaluate_state) to detect an
empty/unavailable level (e.g., level="$(echo "${state}" | cut -d'|' -f1)" and if
[[ -z "${level}" ]]; then log a WARN like "Memory level unavailable; skipping
this check cycle" and return 0) so severity evaluation is skipped for that
cycle.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/memory-pressure-monitor.sh:
- Around line 300-351: The installation currently writes the LaunchAgent plist
at ${PLIST_PATH} in cmd_install without ensuring its parent directory exists;
update cmd_install to create the LaunchAgents directory (mkdir -p "$(dirname
"${PLIST_PATH}")") before redirecting the heredoc, and ensure proper
permissions/ownership if required. Alternatively, add the LaunchAgents path
creation to ensure_dirs (alongside LOG_DIR and STATE_DIR) so the directory
exists for cmd_install; reference cmd_install, PLIST_PATH, ensure_dirs, LOG_DIR
and STATE_DIR when making the change.
- Around line 94-103: The check_cooldown function reads a timestamp from the
cooldown file into last_notify and then performs arithmetic with now and
COOLDOWN_SECS without validating last_notify; add numeric validation and a safe
fallback so corrupted/non-numeric file contents won't break the script.
Specifically, after reading last_notify in check_cooldown, validate it is an
integer (e.g., with a shell regex like '^[0-9]+$' or a POSIX-safe test) and if
invalid or empty set last_notify=0 before computing elapsed, then continue the
existing cooldown comparison using COOLDOWN_SECS; keep variable names
last_notify, now, elapsed and the return semantics unchanged.
- Around line 31-35: Add a new validate_config() function and call it at the
start of main() to ensure MEMORY_WARN_THRESHOLD, MEMORY_CRIT_THRESHOLD, and
MEMORY_COOLDOWN_SECS are valid numbers: parse WARN_THRESHOLD and CRIT_THRESHOLD
as integers, ensure they are numeric and within 0–100, enforce WARN_THRESHOLD >
CRIT_THRESHOLD, and validate COOLDOWN_SECS is a non-negative integer; on any
failure, emit a clear error via SCRIPT_NAME and exit non-zero. Update the
existing numeric comparisons that use WARN_THRESHOLD, CRIT_THRESHOLD, and
COOLDOWN_SECS by relying on the validated/normalized integer variables produced
by validate_config() so string values won’t be implicitly coerced. Ensure
validate_config() handles missing env vars by falling back to the existing
defaults, and logs precise error messages before exiting to aid debugging.

---

Duplicate comments:
In @.agents/scripts/memory-pressure-monitor.sh:
- Around line 116-120: The get_memory_level function should not return "0" on
sysctl failure because that mimics critical memory pressure; change
get_memory_level to return an explicit empty/unavailable sentinel (e.g., empty
string) on failure instead of "0", and then update the caller (do_check before
calling evaluate_state) to detect an empty/unavailable level (e.g.,
level="$(echo "${state}" | cut -d'|' -f1)" and if [[ -z "${level}" ]]; then log
a WARN like "Memory level unavailable; skipping this check cycle" and return 0)
so severity evaluation is skipped for that cycle.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 338513e and 75220c1.

📒 Files selected for processing (1)
  • .agents/scripts/memory-pressure-monitor.sh

Comment on lines +31 to +35
readonly SCRIPT_NAME="memory-pressure-monitor"
readonly WARN_THRESHOLD="${MEMORY_WARN_THRESHOLD:-40}"
readonly CRIT_THRESHOLD="${MEMORY_CRIT_THRESHOLD:-20}"
readonly COOLDOWN_SECS="${MEMORY_COOLDOWN_SECS:-300}"
readonly NOTIFY_ENABLED="${MEMORY_NOTIFY:-true}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "memory-pressure-monitor.sh" -type f

Repository: marcusquinn/aidevops

Length of output: 108


🏁 Script executed:

cat -n .agents/scripts/memory-pressure-monitor.sh

Repository: marcusquinn/aidevops

Length of output: 15243


🏁 Script executed:

#!/bin/bash

# First check if shellcheck is available, install if not
if ! command -v shellcheck &>/dev/null; then
    echo "Installing shellcheck..."
    apt-get update && apt-get install -y shellcheck 2>/dev/null || brew install shellcheck
fi

# Run ShellCheck
shellcheck -S warning .agents/scripts/memory-pressure-monitor.sh || true

Repository: marcusquinn/aidevops

Length of output: 46


🏁 Script executed:

#!/bin/bash

# Test what happens with non-numeric threshold values
echo "=== Testing non-numeric threshold behavior ==="

# Simulate the script's behavior with invalid threshold
WARN_THRESHOLD="abc"
CRIT_THRESHOLD="20"
COOLDOWN_SECS="300"

level=35

echo "Test 1: Non-numeric WARN_THRESHOLD='abc', level=35"
if [[ ${level} -le ${CRIT_THRESHOLD} ]]; then
    echo "  Result: critical"
elif [[ ${level} -le ${WARN_THRESHOLD} ]]; then
    echo "  Result: warning"
else
    echo "  Result: normal"
fi 2>&1 || echo "  ERROR: Arithmetic comparison failed"

echo ""
echo "Test 2: Inverted thresholds (WARN=20, CRIT=40, level=35)"
WARN_THRESHOLD=20
CRIT_THRESHOLD=40
if [[ ${level} -le ${CRIT_THRESHOLD} ]]; then
    echo "  Result: critical (WRONG!)"
elif [[ ${level} -le ${WARN_THRESHOLD} ]]; then
    echo "  Result: warning"
else
    echo "  Result: normal"
fi

echo ""
echo "Test 3: Non-numeric COOLDOWN_SECS behavior"
COOLDOWN_SECS="not_a_number"
elapsed=100
if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then
    echo "  Result: Still in cooldown"
else
    echo "  Result: Cooldown expired"
fi 2>&1 || echo "  ERROR: Arithmetic comparison failed"

Repository: marcusquinn/aidevops

Length of output: 335


🏁 Script executed:

#!/bin/bash

# Now test the suggested validation function logic
echo "=== Testing suggested validation function ==="

validate_config() {
	for v in WARN_THRESHOLD CRIT_THRESHOLD COOLDOWN_SECS; do
		if [[ ! "${!v}" =~ ^[0-9]+$ ]]; then
			echo "Invalid ${v}: ${!v} (must be numeric)" >&2
			return 1
		fi
	done
	if (( WARN_THRESHOLD <= CRIT_THRESHOLD )); then
		echo "Invalid thresholds: warning (${WARN_THRESHOLD}) must be greater than critical (${CRIT_THRESHOLD})" >&2
		return 1
	fi
	if (( WARN_THRESHOLD > 100 || CRIT_THRESHOLD < 0 )); then
		echo "Thresholds must be within 0-100" >&2
		return 1
	fi
}

# Test valid config
echo "Test 1: Valid config (WARN=40, CRIT=20, COOLDOWN=300)"
WARN_THRESHOLD=40
CRIT_THRESHOLD=20
COOLDOWN_SECS=300
validate_config && echo "  ✓ Valid" || echo "  ✗ Invalid"

echo ""
echo "Test 2: Invalid - non-numeric WARN_THRESHOLD"
WARN_THRESHOLD=abc
CRIT_THRESHOLD=20
COOLDOWN_SECS=300
validate_config && echo "  ✓ Valid" || echo "  ✗ Invalid"

echo ""
echo "Test 3: Invalid - inverted thresholds (WARN=20, CRIT=40)"
WARN_THRESHOLD=20
CRIT_THRESHOLD=40
COOLDOWN_SECS=300
validate_config && echo "  ✓ Valid" || echo "  ✗ Invalid"

echo ""
echo "Test 4: Invalid - out of range (WARN=150, CRIT=20)"
WARN_THRESHOLD=150
CRIT_THRESHOLD=20
COOLDOWN_SECS=300
validate_config && echo "  ✓ Valid" || echo "  ✗ Invalid"

echo ""
echo "Test 5: Invalid - non-numeric COOLDOWN_SECS"
WARN_THRESHOLD=40
CRIT_THRESHOLD=20
COOLDOWN_SECS=notanumber
validate_config && echo "  ✓ Valid" || echo "  ✗ Invalid"

Repository: marcusquinn/aidevops

Length of output: 637


🏁 Script executed:

#!/bin/bash

# Verify the exact behavior: bash arithmetic and error handling with non-numeric values
echo "=== Understanding bash -le behavior with non-numeric strings ==="

echo "Test: bash -le with non-numeric string (no error exit)"
bash -c '
set -euo pipefail  # Script uses this
val="abc"
if [[ 35 -le $val ]]; then
    echo "True"
else
    echo "False"
fi
echo "Script continued (no error)"
'

echo ""
echo "Test: bash arithmetic with non-numeric in (( )) expression"
bash -c '
set -euo pipefail
val="abc"
if (( 35 < val )); then
    echo "True"
else
    echo "False (or coerced to 0)"
fi
' 2>&1 || echo "Script failed with exit code $?"

Repository: marcusquinn/aidevops

Length of output: 363


🏁 Script executed:

#!/bin/bash

# Final test: Verify actual silent failure behavior in the script's context
echo "=== Simulating actual script behavior with invalid inputs ==="

# Simulate evaluate_state with non-numeric threshold
evaluate_state() {
	local state="$1"
	local level total_gb swap_total swap_used compressor_segments swap_files
	IFS='|' read -r level total_gb swap_total swap_used compressor_segments swap_files <<<"${state}"

	if [[ ${level} -le ${CRIT_THRESHOLD} ]]; then
		echo "critical"
	elif [[ ${level} -le ${WARN_THRESHOLD} ]]; then
		echo "warning"
	else
		echo "normal"
	fi
}

# Test with non-numeric WARN_THRESHOLD
echo "Scenario 1: Non-numeric WARN_THRESHOLD (should warn at level=35)"
WARN_THRESHOLD="xyz"
CRIT_THRESHOLD=20
state="35|16.0|0|0|0|0"
result=$(evaluate_state "${state}")
echo "  Input: level=35, WARN_THRESHOLD='xyz', CRIT_THRESHOLD=20"
echo "  Result: ${result}"
echo "  Expected: warning (level between thresholds)"
echo "  ISSUE: Silent failure - returns ${result} (wrong!)"

echo ""
echo "Scenario 2: Inverted thresholds (WARN=20, CRIT=40, level=35)"
WARN_THRESHOLD=20
CRIT_THRESHOLD=40
result=$(evaluate_state "${state}")
echo "  Input: level=35, WARN_THRESHOLD=20, CRIT_THRESHOLD=40"
echo "  Result: ${result}"
echo "  Expected: warning (but thresholds are backwards)"
echo "  ISSUE: Returns critical when it should be warning!"

echo ""
echo "Scenario 3: Non-numeric COOLDOWN_SECS in cooldown check"
COOLDOWN_SECS="bad"
elapsed=100
if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then
    echo "  Result: Still in cooldown"
else
    echo "  Result: Cooldown expired (WRONG - should check numeric validity)"
fi

Repository: marcusquinn/aidevops

Length of output: 728


Add validation for threshold and cooldown configuration values

Environment variables MEMORY_WARN_THRESHOLD, MEMORY_CRIT_THRESHOLD, and MEMORY_COOLDOWN_SECS are consumed in numeric comparisons (lines 102, 187, 189) without validation. Non-numeric values silently coerce to incorrect results—for example, WARN_THRESHOLD="abc" causes level 35 to be misclassified as "normal" instead of "warning". Additionally, inverted thresholds (warning ≤ critical) are not caught, causing false critical alerts.

Add a validate_config() function that validates numeric format, enforces WARN_THRESHOLD > CRIT_THRESHOLD, and checks that both fall within 0–100. Call this function at the start of main() before any monitoring logic executes.

Suggested validation function
+validate_config() {
+	for v in WARN_THRESHOLD CRIT_THRESHOLD COOLDOWN_SECS; do
+		if [[ ! "${!v}" =~ ^[0-9]+$ ]]; then
+			echo "Invalid ${v}: ${!v} (must be numeric)" >&2
+			return 1
+		fi
+	done
+	if (( WARN_THRESHOLD <= CRIT_THRESHOLD )); then
+		echo "Invalid thresholds: warning (${WARN_THRESHOLD}) must be greater than critical (${CRIT_THRESHOLD})" >&2
+		return 1
+	fi
+	if (( WARN_THRESHOLD > 100 || CRIT_THRESHOLD < 0 )); then
+		echo "Thresholds must be within 0-100" >&2
+		return 1
+	fi
+}
 main() {
 	local cmd="${1:-check}"
+	validate_config || return 1

Per automation script guidelines: ensure reliability, robustness, and proper error handling with clear feedback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/memory-pressure-monitor.sh around lines 31 - 35, Add a new
validate_config() function and call it at the start of main() to ensure
MEMORY_WARN_THRESHOLD, MEMORY_CRIT_THRESHOLD, and MEMORY_COOLDOWN_SECS are valid
numbers: parse WARN_THRESHOLD and CRIT_THRESHOLD as integers, ensure they are
numeric and within 0–100, enforce WARN_THRESHOLD > CRIT_THRESHOLD, and validate
COOLDOWN_SECS is a non-negative integer; on any failure, emit a clear error via
SCRIPT_NAME and exit non-zero. Update the existing numeric comparisons that use
WARN_THRESHOLD, CRIT_THRESHOLD, and COOLDOWN_SECS by relying on the
validated/normalized integer variables produced by validate_config() so string
values won’t be implicitly coerced. Ensure validate_config() handles missing env
vars by falling back to the existing defaults, and logs precise error messages
before exiting to aid debugging.

Comment on lines +94 to +103
check_cooldown() {
local cooldown_file="$1"
if [[ -f "${cooldown_file}" ]]; then
local last_notify
last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)"
local now
now="$(date +%s)"
local elapsed=$((now - last_notify))
if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then
return 1 # Still in cooldown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -120 .agents/scripts/memory-pressure-monitor.sh | tail -40

Repository: marcusquinn/aidevops

Length of output: 1092


🏁 Script executed:

grep -n "set -" .agents/scripts/memory-pressure-monitor.sh

Repository: marcusquinn/aidevops

Length of output: 84


🏁 Script executed:

cat -n .agents/scripts/memory-pressure-monitor.sh | sed -n '90,110p'

Repository: marcusquinn/aidevops

Length of output: 638


🏁 Script executed:

# Check for shellcheck availability and run it on the script
if command -v shellcheck &>/dev/null; then
  shellcheck -S warning .agents/scripts/memory-pressure-monitor.sh 2>&1 || true
else
  echo "ShellCheck not available in sandbox"
fi

Repository: marcusquinn/aidevops

Length of output: 46


🏁 Script executed:

# Look for how cooldown file content is typically initialized
rg "cooldown_file" .agents/scripts/memory-pressure-monitor.sh -B 2 -A 2

Repository: marcusquinn/aidevops

Length of output: 343


🏁 Script executed:

# Check for numeric validation patterns elsewhere in the script
rg '\[0-9\]' .agents/scripts/memory-pressure-monitor.sh -B 1 -A 1

Repository: marcusquinn/aidevops

Length of output: 46


🏁 Script executed:

# Look for error handling patterns
rg "set -euo pipefail|trap|return 1" .agents/scripts/memory-pressure-monitor.sh | head -20

Repository: marcusquinn/aidevops

Length of output: 176


Validate cooldown timestamps before arithmetic to prevent script failure

The cooldown file content is not validated before use in arithmetic expansion on line 101. If the file becomes corrupted, the $((now - last_notify)) operation will fail under set -euo pipefail, causing script termination. The current fallback to echo 0 only handles read failures, not content validity.

Add numeric validation with graceful recovery:

Suggested fix
 check_cooldown() {
 	local cooldown_file="$1"
 	if [[ -f "${cooldown_file}" ]]; then
 		local last_notify
 		last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)"
+		if [[ ! "${last_notify}" =~ ^[0-9]+$ ]]; then
+			log_msg "WARN" "Invalid cooldown value in ${cooldown_file}; resetting"
+			rm -f "${cooldown_file}"
+			return 0
+		fi
 		local now
 		now="$(date +%s)"
 		local elapsed=$((now - last_notify))
 		if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then
 			return 1 # Still in cooldown

Per automation script guidelines: maintain reliability/robustness and robust error recovery mechanisms.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
check_cooldown() {
local cooldown_file="$1"
if [[ -f "${cooldown_file}" ]]; then
local last_notify
last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)"
local now
now="$(date +%s)"
local elapsed=$((now - last_notify))
if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then
return 1 # Still in cooldown
check_cooldown() {
local cooldown_file="$1"
if [[ -f "${cooldown_file}" ]]; then
local last_notify
last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)"
if [[ ! "${last_notify}" =~ ^[0-9]+$ ]]; then
log_msg "WARN" "Invalid cooldown value in ${cooldown_file}; resetting"
rm -f "${cooldown_file}"
return 0
fi
local now
now="$(date +%s)"
local elapsed=$((now - last_notify))
if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then
return 1 # Still in cooldown
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/memory-pressure-monitor.sh around lines 94 - 103, The
check_cooldown function reads a timestamp from the cooldown file into
last_notify and then performs arithmetic with now and COOLDOWN_SECS without
validating last_notify; add numeric validation and a safe fallback so
corrupted/non-numeric file contents won't break the script. Specifically, after
reading last_notify in check_cooldown, validate it is an integer (e.g., with a
shell regex like '^[0-9]+$' or a POSIX-safe test) and if invalid or empty set
last_notify=0 before computing elapsed, then continue the existing cooldown
comparison using COOLDOWN_SECS; keep variable names last_notify, now, elapsed
and the return semantics unchanged.

Comment on lines +300 to +351
cmd_install() {
local script_path
script_path="$(cd "$(dirname "$0")" && pwd)/$(basename "$0")"

# Verify the script exists at the expected installed location
local installed_path="${HOME}/.aidevops/agents/scripts/${SCRIPT_NAME}.sh"
if [[ -x "${installed_path}" ]]; then
script_path="${installed_path}"
fi

cat >"${PLIST_PATH}" <<PLIST
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>${LAUNCHD_LABEL}</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>${script_path}</string>
</array>
<key>StartInterval</key>
<integer>60</integer>
<key>StandardOutPath</key>
<string>${LOG_DIR}/memory-pressure-launchd.log</string>
<key>StandardErrorPath</key>
<string>${LOG_DIR}/memory-pressure-launchd.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
<key>HOME</key>
<string>${HOME}</string>
</dict>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<false/>
<key>ProcessType</key>
<string>Background</string>
<key>LowPriorityBackgroundIO</key>
<true/>
<key>Nice</key>
<integer>10</integer>
</dict>
</plist>
PLIST

launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true
launchctl bootstrap "gui/$(id -u)" "${PLIST_PATH}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file size and get an overview
wc -l .agents/scripts/memory-pressure-monitor.sh

Repository: marcusquinn/aidevops

Length of output: 110


🏁 Script executed:

# Read the entire script to understand context
cat -n .agents/scripts/memory-pressure-monitor.sh

Repository: marcusquinn/aidevops

Length of output: 15243


🏁 Script executed:

# Run ShellCheck on the script
shellcheck .agents/scripts/memory-pressure-monitor.sh 2>&1 || true

Repository: marcusquinn/aidevops

Length of output: 46


Add directory creation to cmd_install before writing the LaunchAgent plist

The cmd_install function writes the plist file to ${PLIST_PATH} without ensuring the parent directory ~/Library/LaunchAgents/ exists. This causes bootstrap failures on fresh environments. Additionally, ensure_dirs() creates only ${LOG_DIR} and ${STATE_DIR}, but not the LaunchAgent directory.

Required fix
 cmd_install() {
+	ensure_dirs
+	mkdir -p "$(dirname "${PLIST_PATH}")"
+
 	local script_path
 	script_path="$(cd "$(dirname "$0")" && pwd)/$(basename "$0")"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cmd_install() {
local script_path
script_path="$(cd "$(dirname "$0")" && pwd)/$(basename "$0")"
# Verify the script exists at the expected installed location
local installed_path="${HOME}/.aidevops/agents/scripts/${SCRIPT_NAME}.sh"
if [[ -x "${installed_path}" ]]; then
script_path="${installed_path}"
fi
cat >"${PLIST_PATH}" <<PLIST
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>${LAUNCHD_LABEL}</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>${script_path}</string>
</array>
<key>StartInterval</key>
<integer>60</integer>
<key>StandardOutPath</key>
<string>${LOG_DIR}/memory-pressure-launchd.log</string>
<key>StandardErrorPath</key>
<string>${LOG_DIR}/memory-pressure-launchd.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
<key>HOME</key>
<string>${HOME}</string>
</dict>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<false/>
<key>ProcessType</key>
<string>Background</string>
<key>LowPriorityBackgroundIO</key>
<true/>
<key>Nice</key>
<integer>10</integer>
</dict>
</plist>
PLIST
launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true
launchctl bootstrap "gui/$(id -u)" "${PLIST_PATH}"
cmd_install() {
ensure_dirs
mkdir -p "$(dirname "${PLIST_PATH}")"
local script_path
script_path="$(cd "$(dirname "$0")" && pwd)/$(basename "$0")"
# Verify the script exists at the expected installed location
local installed_path="${HOME}/.aidevops/agents/scripts/${SCRIPT_NAME}.sh"
if [[ -x "${installed_path}" ]]; then
script_path="${installed_path}"
fi
cat >"${PLIST_PATH}" <<PLIST
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>${LAUNCHD_LABEL}</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>${script_path}</string>
</array>
<key>StartInterval</key>
<integer>60</integer>
<key>StandardOutPath</key>
<string>${LOG_DIR}/memory-pressure-launchd.log</string>
<key>StandardErrorPath</key>
<string>${LOG_DIR}/memory-pressure-launchd.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
<key>HOME</key>
<string>${HOME}</string>
</dict>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<false/>
<key>ProcessType</key>
<string>Background</string>
<key>LowPriorityBackgroundIO</key>
<true/>
<key>Nice</key>
<integer>10</integer>
</dict>
</plist>
PLIST
launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true
launchctl bootstrap "gui/$(id -u)" "${PLIST_PATH}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/memory-pressure-monitor.sh around lines 300 - 351, The
installation currently writes the LaunchAgent plist at ${PLIST_PATH} in
cmd_install without ensuring its parent directory exists; update cmd_install to
create the LaunchAgents directory (mkdir -p "$(dirname "${PLIST_PATH}")") before
redirecting the heredoc, and ensure proper permissions/ownership if required.
Alternatively, add the LaunchAgents path creation to ensure_dirs (alongside
LOG_DIR and STATE_DIR) so the directory exists for cmd_install; reference
cmd_install, PLIST_PATH, ensure_dirs, LOG_DIR and STATE_DIR when making the
change.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

8 similar comments
@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

CI status: FAIL (SonarCloud analysis failing, auto-fix check failing)
Review status: CHANGES_REQUESTED (CodeRabbit flagged actionable issues)

Key issues from CodeRabbit review:

  1. get_memory_level returns "0" on sysctl failure — indistinguishable from genuine critical pressure, can trigger false alerts
  2. check_cooldown performs arithmetic on unvalidated timestamp from file — corrupted content could break the script
  3. cmd_install doesn't ensure parent directory exists before writing plist
  4. Missing validate_config() for env var validation

These should be addressed by the contributor before this can be merged.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

Review findings summary:

  • CodeRabbit (CHANGES_REQUESTED): get_memory_level returns "0" on sysctl failure (indistinguishable from critical pressure → false alerts). Fix: return empty sentinel, skip evaluation cycle.
  • CodeRabbit (nitpick): daemon mode lacks SIGTERM/SIGINT trap for graceful shutdown.
  • Gemini Code Assist (COMMENTED): security concerns — command injection via arithmetic eval, XML injection in launchd plist generation, AppleScript injection in notification function, excessive 2>/dev/null suppression.
  • CI: SonarCloud and 'Monitor & Auto-Fix Code Quality' checks failing.

The core feature (memory pressure monitoring) is valuable and well-motivated. The security issues flagged by Gemini need addressing before merge. Maintainer review required.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

3 similar comments
@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

Review findings summary:

  • CodeRabbit (CHANGES_REQUESTED): get_memory_level returns "0" on sysctl failure (indistinguishable from critical pressure — can trigger false alerts). Also: missing LaunchAgents dir creation in cmd_install, cooldown timestamp not validated, no validate_config() for env var inputs.
  • Gemini Code Assist: security concerns — command injection via bash arithmetic, XML injection in launchd plist generation, AppleScript injection in notification function, excessive 2>/dev/null suppression.
  • CI: SonarCloud and Monitor & Auto-Fix checks failing.

The feature itself is valuable (motivated by a real kernel panic). The issues above need to be addressed by the contributor before this can be merged.

@alex-solovyev
Copy link
Collaborator

This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually.

@alex-solovyev alex-solovyev added the external-contributor PR from external contributor, requires maintainer review label Mar 3, 2026
alex-solovyev added a commit that referenced this pull request Mar 4, 2026
gh pr edit --add-label fails with GraphQL Projects deprecation warning
(exit 1), preventing the external-contributor label from being applied.
This broke the idempotency guard, causing duplicate comments on every
pulse cycle (17+ on PR #2792).

Replace both instances (external-contributor gate and orphaned PR
scanner) with gh api repos/<slug>/issues/<number>/labels REST endpoint
which works correctly.

Closes #2793
alex-solovyev added a commit that referenced this pull request Mar 4, 2026
…sed on API errors

Root cause: the previous guard (#2796) used 2>&1 to capture stderr into the
output variable, but this meant API failures produced non-empty strings that
didn't match the grep patterns, causing the if/elif/else chain to fall through
to the 'post comment' branch on every pulse cycle (~2 min). PR #2792 accumulated
20+ identical comments from this bug.

Three structural changes:
1. Separate exit code from output: capture $? independently so API failures
   are detected by exit code, not by parsing error text
2. Fail closed: if either the label check or comment check API call fails
   (non-zero exit), skip posting entirely — the next pulse retries
3. Independent checks: check label AND comment as separate boolean flags,
   then use a 3-way decision (API error → skip, already flagged → skip/re-label,
   clean → post). The old if/elif/else structure conflated 'API error' with
   'not found'.

Same fix applied to the permission-failure comment guard (outcome 3).

Closes #2802
alex-solovyev added a commit that referenced this pull request Mar 4, 2026
Root cause: the comment idempotency check used a pipe (gh pr view | grep)
which silently returned 'no match' when the API call failed, causing the
guard to fall through and post a duplicate comment every 2-minute pulse
cycle. This produced 18 duplicate comments on PR #2792.

Three fixes applied:
1. Fetch comments into a variable (like labels) instead of piping — so we
   can capture the exit code and detect API failures
2. Fail closed: if the comment fetch fails, skip posting (assume already
   flagged). The next pulse cycle retries when the API recovers
3. Replace 2>/dev/null with 2>&1 on label-add operations so failures are
   visible in logs

Same pattern applied to the permission-failure comment guard.

Closes #2802
alex-solovyev added a commit that referenced this pull request Mar 4, 2026
…sed on API errors

Root cause: the previous guard (#2796) used 2>&1 to capture stderr into the
output variable, but this meant API failures produced non-empty strings that
didn't match the grep patterns, causing the if/elif/else chain to fall through
to the 'post comment' branch on every pulse cycle (~2 min). PR #2792 accumulated
20+ identical comments from this bug.

Three structural changes:
1. Separate exit code from output: capture $? independently so API failures
   are detected by exit code, not by parsing error text
2. Fail closed: if either the label check or comment check API call fails
   (non-zero exit), skip posting entirely — the next pulse retries
3. Independent checks: check label AND comment as separate boolean flags,
   then use a 3-way decision (API error → skip, already flagged → skip/re-label,
   clean → post). The old if/elif/else structure conflated 'API error' with
   'not found'.

Same fix applied to the permission-failure comment guard (outcome 3).

Closes #2802
alex-solovyev added a commit that referenced this pull request Mar 4, 2026
…sed on API errors (#2803)

Root cause: the previous guard (#2796) used 2>&1 to capture stderr into the
output variable, but this meant API failures produced non-empty strings that
didn't match the grep patterns, causing the if/elif/else chain to fall through
to the 'post comment' branch on every pulse cycle (~2 min). PR #2792 accumulated
20+ identical comments from this bug.

Three structural changes:
1. Separate exit code from output: capture $? independently so API failures
   are detected by exit code, not by parsing error text
2. Fail closed: if either the label check or comment check API call fails
   (non-zero exit), skip posting entirely — the next pulse retries
3. Independent checks: check label AND comment as separate boolean flags,
   then use a 3-way decision (API error → skip, already flagged → skip/re-label,
   clean → post). The old if/elif/else structure conflated 'API error' with
   'not found'.

Same fix applied to the permission-failure comment guard (outcome 3).

Closes #2802
alex-solovyev added a commit that referenced this pull request Mar 4, 2026
… function (t1391)

The LLM supervisor kept re-implementing the inline bash idempotency
guard incorrectly on each pulse cycle, causing 15+ duplicate
'external contributor' comments on PR #2792 despite 4 prior fix
attempts (PRs #2794, #2796, #2801, #2803) — all in pulse.md prompt
text.

Root cause: the check is deterministic (one correct answer regardless
of context) but was encoded as prompt guidance that the LLM had to
re-implement each cycle. Per the 'Intelligence Over Determinism'
principle, deterministic logic belongs in the harness, not the prompt.

Changes:
- Add check_external_contributor_pr() to pulse-wrapper.sh: checks
  BOTH label AND comment, captures exit codes separately, fails
  closed on any API error, only posts when confirmed safe
- Add check_permission_failure_pr() companion function for the
  permission API failure case
- Add source guard (BASH_SOURCE check) so pulse-wrapper.sh can be
  sourced for its functions without triggering the full lifecycle
- Update pulse.md to call the helper functions instead of inline bash
- Update Hard Rule 12 to reference the helper functions

Closes #2809
alex-solovyev added a commit that referenced this pull request Mar 4, 2026
… function (t1391) (#2810)

The LLM supervisor kept re-implementing the inline bash idempotency
guard incorrectly on each pulse cycle, causing 15+ duplicate
'external contributor' comments on PR #2792 despite 4 prior fix
attempts (PRs #2794, #2796, #2801, #2803) — all in pulse.md prompt
text.

Root cause: the check is deterministic (one correct answer regardless
of context) but was encoded as prompt guidance that the LLM had to
re-implement each cycle. Per the 'Intelligence Over Determinism'
principle, deterministic logic belongs in the harness, not the prompt.

Changes:
- Add check_external_contributor_pr() to pulse-wrapper.sh: checks
  BOTH label AND comment, captures exit codes separately, fails
  closed on any API error, only posts when confirmed safe
- Add check_permission_failure_pr() companion function for the
  permission API failure case
- Add source guard (BASH_SOURCE check) so pulse-wrapper.sh can be
  sourced for its functions without triggering the full lifecycle
- Update pulse.md to call the helper functions instead of inline bash
- Update Hard Rule 12 to reference the helper functions

Closes #2809
@alex-solovyev
Copy link
Collaborator

Pulse note (2026-03-04): This PR has accumulated duplicate 'external contributor' comments due to a bug in the pulse idempotency guard (tracked in issue #2812 — fix dispatched). The label external-contributor is set correctly. No further action needed from the pulse until the contributor addresses the review feedback (CodeRabbit CHANGES_REQUESTED + CI failures). Maintainer review required before merge.

@alex-solovyev
Copy link
Collaborator

Supervisor note: A worker process (PID 3159030) was dispatched to fix the CodeRabbit review feedback on this PR. After 3+ hours with 0 commits (struggle_ratio: 8240), the worker was killed — it was thrashing without making progress.

Additionally, this PR is from an external contributor (@johnwaldo, read permission). Per project policy, maintainers do not fix CI/review issues on external contributor PRs — that is the contributor's responsibility.

@johnwaldo: Please address the CodeRabbit CHANGES_REQUESTED feedback directly on your branch. The issues to fix are documented in the review comments above.

@marcusquinn
Copy link
Owner

those memory pressure thresholds might be overaggressive, as the os can continue to run fine with up to say 200% utilisation if compression and swap drive is used

@marcusquinn
Copy link
Owner

Closing this PR — thank you @johnwaldo for the contribution and the motivation (the March 3 kernel panic was real and serious).

After root-cause analysis, the actual memory problem is caused by aidevops's own processes, not general OS memory pressure:

  1. ShellCheck with --external-sources recursively expanding across 100+ scripts — single invocation observed at 5.7 GB RSS
  2. Zombie pulse processes persisting for 28+ hours when launchd stops re-invoking
  3. 10+ concurrent opencode sessions accumulating 2.5+ GB with no framework warning

The kern.memorystatus_level thresholds in this PR (40% warning, 20% critical) are too aggressive — macOS runs fine with compression and swap well beyond those levels. The right fix is to guard the framework's own resource consumption.

Replacement task created: #2854 (t1398) — covers process resource guards, shellcheck hardening, pulse self-watchdog, session awareness, and a rewritten memory monitor targeting process-level metrics instead of OS-level pressure.

@marcusquinn marcusquinn closed this Mar 5, 2026
marcusquinn added a commit that referenced this pull request Mar 5, 2026
… [skip ci]

Root-cause analysis of March 3 kernel panic identified three sources:
shellcheck exponential expansion (5+ GB), zombie pulse processes,
session accumulation (2.5+ GB). Closes PR #2792, creates GH#2854.
alex-solovyev added a commit that referenced this pull request Mar 5, 2026
…398)

Root-cause fix for runaway memory consumption (March 3 kernel panic):

1. Process guard in pulse-wrapper.sh: guard_child_processes() runs every
   60s in the watchdog loop, killing any child process exceeding
   configurable RSS (2GB default) or runtime (10min default) limits.
   ShellCheck gets stricter limits (1GB/5min) due to known exponential
   expansion with --external-sources.

2. ShellCheck hardened: linters-local.sh now runs per-file with timeout
   (30s each) instead of batch mode. Prevents a single file with
   recursive source directives from consuming 5+ GB RAM. Quality sweep
   in pulse-wrapper.sh also gets per-file timeout.

3. Session count awareness: check_session_count() warns when >5
   concurrent interactive sessions are open. Count displayed in the
   health issue dashboard with threshold warning.

4. Standalone process-guard-helper.sh: scan, kill-runaways, sessions,
   and status commands for manual/cron use. Monitors the right signals
   (individual RSS, runtime, process count) — replaces PR #2792 concept.

Closes #2854
marcusquinn added a commit that referenced this pull request Mar 5, 2026
…mits, session awareness (#2855)

* fix: add process guards, ShellCheck limits, and session awareness (t1398)

Root-cause fix for runaway memory consumption (March 3 kernel panic):

1. Process guard in pulse-wrapper.sh: guard_child_processes() runs every
   60s in the watchdog loop, killing any child process exceeding
   configurable RSS (2GB default) or runtime (10min default) limits.
   ShellCheck gets stricter limits (1GB/5min) due to known exponential
   expansion with --external-sources.

2. ShellCheck hardened: linters-local.sh now runs per-file with timeout
   (30s each) instead of batch mode. Prevents a single file with
   recursive source directives from consuming 5+ GB RAM. Quality sweep
   in pulse-wrapper.sh also gets per-file timeout.

3. Session count awareness: check_session_count() warns when >5
   concurrent interactive sessions are open. Count displayed in the
   health issue dashboard with threshold warning.

4. Standalone process-guard-helper.sh: scan, kill-runaways, sessions,
   and status commands for manual/cron use. Monitors the right signals
   (individual RSS, runtime, process count) — replaces PR #2792 concept.

Closes #2854

* fix: address review findings — validate env vars, fix ps field, reduce subshells (t1398)

Security: add _validate_int() to process-guard-helper.sh to prevent command
injection via arithmetic expansion of env vars (CHILD_RSS_LIMIT_KB, etc.).
Mirrors the same pattern already used in pulse-wrapper.sh.

Bug: change ps 'comm' field to 'command' so grep pattern 'node.*opencode'
can match the full command line instead of just the executable name.

Efficiency: replace awk/echo subshell spawning in while loops with read
builtin for field parsing in both process-guard-helper.sh and pulse-wrapper.sh.

Style: remove 2>/dev/null from mkdir -p and shellcheck invocations — the
|| true is sufficient to prevent set -e exits while keeping errors visible.

* fix: address CodeRabbit review — portable timeout, TTY guard, env validation (t1398)

- linters-local.sh: portable background+watcher timeout for shellcheck
  fallback when neither timeout nor gtimeout is available
- process-guard-helper.sh: add tty field to ps, skip TTY-attached
  (interactive) processes from kill logic; _validate_int() sanitizes
  all numeric env vars before arithmetic expansion
- pulse-wrapper.sh: skip CHILD_RUNTIME_LIMIT for primary pulse PID
  (governed by PULSE_STALE_THRESHOLD); skip shellcheck entirely when
  no timeout utility available to prevent runaway expansion

* fix: correct session counting for Linux TTY and resolve SC2126 (t1398)

- Replace grep|wc-l session counting with awk that excludes both '?' (Linux
  headless) and '??' (macOS headless) TTYs — fixes misclassification of
  headless processes as interactive sessions on Linux
- Eliminates SC2126 (grep|wc-l) violations flagged by CodeFactor
- Applied consistently across all 4 session counting locations:
  process-guard-helper.sh (cmd_scan, cmd_sessions, cmd_status) and
  pulse-wrapper.sh (check_session_count)

Addresses CodeRabbit second review and CodeFactor CI failure.

* fix: replace grep|wc-l with grep -c in ShellCheck result counting (SC2126)

* fix: rename ppid_unused to _ppid to suppress SC2034 warning (t1398)

* fix: address CodeRabbit round-2 review — TTY filtering, stderr, subshell reduction (t1398)

- Fix session counting to filter both '?' (Linux) and '??' (macOS) headless
  TTY entries using awk instead of grep -v '??' (4 locations across
  process-guard-helper.sh and pulse-wrapper.sh check_session_count)
- Remove blanket 2>/dev/null on ps commands in process-guard while loops
  and pulse-wrapper guard_child_processes; keep || true for error handling
- Replace echo|tr|wc subshell pipeline with parameter expansion for
  colon counting in _get_process_age (zero subshell overhead)

* fix: address remaining CodeRabbit review — comm→command, echo|grep→<<<, remove 2>/dev/null (t1398)

pulse-wrapper.sh:
- guard_child_processes: use 'command' field instead of 'comm' for full
  command line matching (node.*opencode pattern now works)
- Extract cmd_base from full command path for limit selection
- Replace echo|grep -c and echo|head with <<<$result herestrings
  in quality sweep loop (avoids spawning subshells per iteration)
- Remove blanket 2>/dev/null from ps axo commands in
  prefetch_active_workers and _update_health_issue_for_repo
  (keep || true for pipeline failure tolerance)

---------

Co-authored-by: marcusquinn <6428977+marcusquinn@users.noreply.github.com>
alex-solovyev added a commit that referenced this pull request Mar 5, 2026
Rewrite memory pressure monitor to monitor the right signals:
- Per-process RSS (warn 2GB, critical 4GB) instead of kern.memorystatus_level
- Process runtime limits (10min shellcheck, 30min other tools)
- Interactive session count (warn >= 5)
- Aggregate aidevops RSS (warn >= 8GB)

kern.memorystatus_level retained as secondary/informational signal only.
Fixes security issues from declined PR #2792 (command injection via
arithmetic eval, AppleScript injection, XML injection in plist).

Includes 28 unit tests covering validation, cooldown, CLI, and security.

Closes #2878
alex-solovyev added a commit that referenced this pull request Mar 5, 2026
Rewrite memory pressure monitor to monitor the right signals:
- Per-process RSS (warn 2GB, critical 4GB) instead of kern.memorystatus_level
- Process runtime limits (10min shellcheck, 30min other tools)
- Interactive session count (warn >= 5)
- Aggregate aidevops RSS (warn >= 8GB)

kern.memorystatus_level retained as secondary/informational signal only.
Fixes security issues from declined PR #2792 (command injection via
arithmetic eval, AppleScript injection, XML injection in plist).

Includes 28 unit tests covering validation, cooldown, CLI, and security.

Closes #2878
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contributor PR from external contributor, requires maintainer review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants