feat: add memory pressure monitor with launchd integration#2792
feat: add memory pressure monitor with launchd integration#2792johnwaldo wants to merge 2 commits intomarcusquinn:mainfrom
Conversation
Early warning system for macOS memory pressure that monitors kern.memorystatus_level, swap state, and compressor segments. Sends desktop notifications at warning (<=40%) and critical (<=20%) thresholds with top memory consumers. Includes --install for launchd (30s interval), --daemon mode, and --status for quick checks. Motivated by a kernel panic caused by memory exhaustion cascade (46 swap files, 100% compressor segment limit) under heavy Electron + opencode workloads.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a critical system monitoring tool for macOS, aimed at preventing kernel panics caused by memory exhaustion. By proactively tracking memory pressure indicators and providing timely, detailed notifications, it empowers users to intervene before system stability is compromised. The integration with Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughAdds a new Bash agent script that monitors macOS memory pressure, swap and VM compressor state. Provides single-check, daemonized monitoring (LaunchAgent), and status modes; emits notifications (terminal-notifier or osascript) with per-level cooldowns, logging, install/uninstall of a LaunchAgent plist, and configurable thresholds. Changes
Sequence Diagram(s)sequenceDiagram
participant Launchd as Launchd Scheduler
participant Script as memory-pressure-monitor.sh
participant Kernel as macOS Kernel
participant Notifier as Notification Agent
participant Logger as Log File
Launchd->>Script: Run (daemon mode / schedule)
Script->>Kernel: Query kern.memorystatus_level, swap, compressor, VM stats
Kernel-->>Script: Return memory metrics
Script->>Script: Evaluate thresholds & cooldowns
alt Threshold exceeded & cooldown expired
Script->>Notifier: Send notification (terminal-notifier or osascript)
Notifier-->>Script: Delivery result
Script->>Script: Update cooldown timestamp
end
Script->>Logger: Append metrics and top consumers
Logger-->>Script: Confirm write
Script-->>Launchd: Exit/status
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.agents/scripts/memory-pressure-monitor.sh (1)
289-298: Daemon mode works but could benefit from signal handling for cleaner shutdown.The infinite loop is functional, but adding a trap for SIGTERM/SIGINT would improve observability by logging when the daemon stops. This is optional since launchd manages the lifecycle anyway.
✨ Optional: Add signal trap for graceful shutdown logging
cmd_daemon() { echo "[${SCRIPT_NAME}] Starting daemon mode (interval: ${DAEMON_INTERVAL}s)" echo "[${SCRIPT_NAME}] Thresholds: warn=${WARN_THRESHOLD}%, crit=${CRIT_THRESHOLD}%" echo "[${SCRIPT_NAME}] Press Ctrl+C to stop" + trap 'echo "[${SCRIPT_NAME}] Shutting down"; exit 0' SIGTERM SIGINT while true; do do_check sleep "${DAEMON_INTERVAL}" done }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/memory-pressure-monitor.sh around lines 289 - 298, Add a signal trap in cmd_daemon to handle SIGINT and SIGTERM so the daemon logs a graceful shutdown and exits cleanly: inside cmd_daemon (which currently contains the infinite while true loop calling do_check and sleeping DAEMON_INTERVAL), set trap handlers for SIGINT and SIGTERM that echo a shutdown message including SCRIPT_NAME and then exit, and ensure the trap is installed before entering the loop so the daemon logs when it stops.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/memory-pressure-monitor.sh:
- Around line 116-120: get_memory_level currently returns "0" on sysctl failure
which is indistinguishable from an actual critical reading; change
get_memory_level to capture the sysctl output and exit status, validate the
output is a numeric 0-100 value (e.g., via a regex or awk), and only echo the
numeric value when valid; on failure or invalid output echo an empty string (or
a sentinel like "unknown") instead of "0" and optionally emit a warning to
stderr so callers can treat non-numeric/empty as "unavailable" rather than
critical.
---
Nitpick comments:
In @.agents/scripts/memory-pressure-monitor.sh:
- Around line 289-298: Add a signal trap in cmd_daemon to handle SIGINT and
SIGTERM so the daemon logs a graceful shutdown and exits cleanly: inside
cmd_daemon (which currently contains the infinite while true loop calling
do_check and sleeping DAEMON_INTERVAL), set trap handlers for SIGINT and SIGTERM
that echo a shutdown message including SCRIPT_NAME and then exit, and ensure the
trap is installed before entering the loop so the daemon logs when it stops.
| get_memory_level() { | ||
| # kern.memorystatus_level: 0-100, percentage of memory considered "free" | ||
| # by the kernel (includes purgeable, inactive, free pages) | ||
| sysctl -n kern.memorystatus_level 2>/dev/null || echo "0" | ||
| } |
There was a problem hiding this comment.
Fallback to "0" on sysctl failure could trigger false critical alerts.
If sysctl -n kern.memorystatus_level fails (permissions, kernel issue, etc.), returning "0" will be interpreted as critical memory pressure, causing spurious notifications. Consider differentiating between failure and actual low memory.
🛡️ Proposed fix to handle sysctl failure gracefully
get_memory_level() {
# kern.memorystatus_level: 0-100, percentage of memory considered "free"
# by the kernel (includes purgeable, inactive, free pages)
- sysctl -n kern.memorystatus_level 2>/dev/null || echo "0"
+ local level
+ if ! level="$(sysctl -n kern.memorystatus_level 2>/dev/null)" || [[ -z "${level}" ]]; then
+ log_msg "ERROR" "Failed to read kern.memorystatus_level"
+ echo "100" # Fail safe: assume normal to avoid false alerts
+ return 1
+ fi
+ echo "${level}"
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/memory-pressure-monitor.sh around lines 116 - 120,
get_memory_level currently returns "0" on sysctl failure which is
indistinguishable from an actual critical reading; change get_memory_level to
capture the sysctl output and exit status, validate the output is a numeric
0-100 value (e.g., via a regex or awk), and only echo the numeric value when
valid; on failure or invalid output echo an empty string (or a sentinel like
"unknown") instead of "0" and optionally emit a warning to stderr so callers can
treat non-numeric/empty as "unavailable" rather than critical.
Halves launchd process churn with no practical loss in detection time. The memory pressure cascade that caused the kernel panic took minutes to develop — 60s granularity is more than sufficient.
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
There was a problem hiding this comment.
Code Review
This pull request introduces a well-designed and valuable memory pressure monitor for macOS. However, it contains several security vulnerabilities, including command injection via bash arithmetic evaluation, XML injection in the generated launchd plist, and AppleScript injection in the notification function. Additionally, the script frequently suppresses command errors using 2>/dev/null, which hinders debugging and goes against project rules for shell scripting. These issues should be addressed by validating environment variables, properly escaping data, and ensuring errors are visible for diagnostics.
|
|
||
| # Fallback: osascript | ||
| if command -v osascript &>/dev/null; then | ||
| osascript -e "display notification \"${message}\" with title \"Memory Pressure: ${title}\"" 2>/dev/null || true |
There was a problem hiding this comment.
The notify function is vulnerable to AppleScript injection because message and title are passed to osascript without proper escaping, allowing arbitrary commands if they contain double quotes. The provided suggestion addresses the injection and removes the 2>/dev/null suppression, ensuring potential osascript errors are visible for troubleshooting, in line with our guidelines.
| osascript -e "display notification \"${message}\" with title \"Memory Pressure: ${title}\"" 2>/dev/null || true | |
| local escaped_message="${message//"/\"}" | |
| local escaped_title="${title//"/\"}" | |
| osascript -e "display notification \"${escaped_message}\" with title \"Memory Pressure: ${escaped_title}\"" || true |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| readonly WARN_THRESHOLD="${MEMORY_WARN_THRESHOLD:-40}" | ||
| readonly CRIT_THRESHOLD="${MEMORY_CRIT_THRESHOLD:-20}" | ||
| readonly COOLDOWN_SECS="${MEMORY_COOLDOWN_SECS:-300}" |
There was a problem hiding this comment.
The script uses environment variables (MEMORY_WARN_THRESHOLD, MEMORY_CRIT_THRESHOLD, MEMORY_COOLDOWN_SECS) directly in bash arithmetic evaluations (e.g., [[ ... -le ... ]] on lines 187 and 189, and $(( ... )) on line 101). Bash performs arithmetic expansion on these operands, which can lead to arbitrary command execution if the variables contain malicious expressions like a[$(id)]. It is recommended to validate that these variables contain only numeric values before assigning them to the readonly variables.
| readonly WARN_THRESHOLD="${MEMORY_WARN_THRESHOLD:-40}" | |
| readonly CRIT_THRESHOLD="${MEMORY_CRIT_THRESHOLD:-20}" | |
| readonly COOLDOWN_SECS="${MEMORY_COOLDOWN_SECS:-300}" | |
| readonly WARN_THRESHOLD=$(echo "${MEMORY_WARN_THRESHOLD:-40}" | grep -E '^[0-9]+$' || echo 40) | |
| readonly CRIT_THRESHOLD=$(echo "${MEMORY_CRIT_THRESHOLD:-20}" | grep -E '^[0-9]+$' || echo 20) | |
| readonly COOLDOWN_SECS=$(echo "${MEMORY_COOLDOWN_SECS:-300}" | grep -E '^[0-9]+$' || echo 300) |
| readonly NOTIFY_ENABLED="${MEMORY_NOTIFY:-true}" | ||
| readonly DAEMON_INTERVAL=60 | ||
|
|
||
| readonly LOG_DIR="${MEMORY_LOG_DIR:-${HOME}/.aidevops/logs}" |
There was a problem hiding this comment.
The LOG_DIR variable, derived from the MEMORY_LOG_DIR environment variable, is used to construct file paths and is embedded directly into a launchd plist XML template on line 325. An attacker could provide a value containing XML tags (e.g., </string><key>ProgramArguments</key>...) to alter the structure of the plist and execute arbitrary commands when the agent is started by launchd. It is recommended to sanitize the input to remove characters that could break the XML structure.
| readonly LOG_DIR="${MEMORY_LOG_DIR:-${HOME}/.aidevops/logs}" | |
| readonly LOG_DIR=$(echo "${MEMORY_LOG_DIR:-${HOME}/.aidevops/logs}" | tr -d '<>&') |
| -message "${message}" \ | ||
| -sound "${sound}" \ | ||
| -group "${SCRIPT_NAME}" \ | ||
| -sender "com.apple.ActivityMonitor" 2>/dev/null || true |
There was a problem hiding this comment.
The use of 2>/dev/null here suppresses all error output from terminal-notifier. While || true correctly prevents the script from exiting on failure, hiding the error message makes it difficult to diagnose why notifications might not be working (e.g., a misconfigured terminal-notifier, permission issues). In accordance with our general rule to avoid blanket error suppression, it's better to let error messages be visible for debugging.
| -sender "com.apple.ActivityMonitor" 2>/dev/null || true | |
| -sender "com.apple.ActivityMonitor" || true |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| local cooldown_file="$1" | ||
| if [[ -f "${cooldown_file}" ]]; then | ||
| local last_notify | ||
| last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)" |
There was a problem hiding this comment.
This use of 2>/dev/null violates our rule against suppressing errors on file operations when the file's existence has already been checked. The [[ -f ... ]] check on line 96 makes this redundant for "file not found" errors, and it can mask other important issues like file permission problems. The || echo 0 fallback is sufficient to handle failures gracefully.
| last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)" | |
| last_notify="$(cat "${cooldown_file}" || echo 0)" |
References
- Avoid using
2>/dev/nullto suppress errors on file operations if the file's existence has already been verified by a preceding check (e.g.,[[ -f "$file" ]]or an early return). This practice is redundant for 'file not found' errors and can mask other important issues like permissions problems.
| get_memory_level() { | ||
| # kern.memorystatus_level: 0-100, percentage of memory considered "free" | ||
| # by the kernel (includes purgeable, inactive, free pages) | ||
| sysctl -n kern.memorystatus_level 2>/dev/null || echo "0" |
There was a problem hiding this comment.
Suppressing stderr with 2>/dev/null here hides potentially important diagnostic information if sysctl fails (e.g., due to permissions or if the kernel key changes in a future macOS version). The || echo "0" construct already ensures the script continues with a default value, so the error suppression is not needed for script stability and is detrimental to debuggability, per our project guidelines. This same issue applies to lines 124, 131, and 139.
| sysctl -n kern.memorystatus_level 2>/dev/null || echo "0" | |
| sysctl -n kern.memorystatus_level || echo "0" |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
|
|
||
| get_swap_file_count() { | ||
| local count | ||
| count="$(find /private/var/vm -name 'swapfile*' 2>/dev/null | wc -l | tr -d ' ')" |
There was a problem hiding this comment.
Using 2>/dev/null with find can hide important errors, such as "Permission denied" when traversing /private/var/vm. As per our guidelines, such errors should remain visible for debugging. Since stderr is separate from stdout, error messages will not interfere with the wc -l pipe, so there is no need to suppress them.
| count="$(find /private/var/vm -name 'swapfile*' 2>/dev/null | wc -l | tr -d ' ')" | |
| count="$(find /private/var/vm -name 'swapfile*' | wc -l | tr -d ' ')" |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| # ps output: RSS (KB), %MEM, COMMAND — sorted by RSS descending | ||
| # Use process substitution to avoid SIGPIPE from head closing the pipe | ||
| local ps_output | ||
| ps_output="$(ps -eo rss=,pmem=,comm= -r 2>/dev/null | head -n "${count}" || true)" |
There was a problem hiding this comment.
Blanket error suppression with 2>/dev/null on the ps command should be avoided. If ps fails, it's important to see the error to understand why the script isn't reporting top memory consumers. The || true at the end of the pipeline already prevents the script from exiting due to set -e, so removing the redirection is safe and improves debuggability.
| ps_output="$(ps -eo rss=,pmem=,comm= -r 2>/dev/null | head -n "${count}" || true)" | |
| ps_output="$(ps -eo rss=,pmem=,comm= -r | head -n "${count}" || true)" |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| </plist> | ||
| PLIST | ||
|
|
||
| launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true |
There was a problem hiding this comment.
The 2>/dev/null here suppresses all errors from launchctl bootout. While it's intended to ignore the "service not found" error when the agent isn't installed, it will also hide other potentially critical errors (e.g., permission issues talking to launchd). The || true is sufficient to prevent the script from exiting. Letting errors be visible makes the script more robust and easier to debug. A similar issue exists on line 360.
| launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true | |
| launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" || true |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
.agents/scripts/memory-pressure-monitor.sh (1)
116-120:⚠️ Potential issue | 🟠 MajorAvoid mapping telemetry read failures to a critical memory value
Line 119 returning
"0"onsysctlfailure is indistinguishable from genuine critical pressure and can trigger false critical alerts. Return an explicit “unavailable” sentinel and skip severity evaluation for that cycle.Suggested fix
get_memory_level() { # kern.memorystatus_level: 0-100, percentage of memory considered "free" # by the kernel (includes purgeable, inactive, free pages) - sysctl -n kern.memorystatus_level 2>/dev/null || echo "0" + local level + if ! level="$(sysctl -n kern.memorystatus_level 2>/dev/null)"; then + log_msg "WARN" "Unable to read kern.memorystatus_level" + echo "" + return 1 + fi + if [[ ! "${level}" =~ ^[0-9]+$ ]] || (( level < 0 || level > 100 )); then + log_msg "WARN" "Invalid kern.memorystatus_level value: ${level}" + echo "" + return 1 + fi + echo "${level}" }# also guard in do_check before evaluate_state level="$(echo "${state}" | cut -d'|' -f1)" if [[ -z "${level}" ]]; then log_msg "WARN" "Memory level unavailable; skipping this check cycle" return 0 fiAs per coding guidelines,
.agents/scripts/*.sh: Automation scripts should emphasize reliability/robustness and error recovery mechanisms.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/memory-pressure-monitor.sh around lines 116 - 120, The get_memory_level function should not return "0" on sysctl failure because that mimics critical memory pressure; change get_memory_level to return an explicit empty/unavailable sentinel (e.g., empty string) on failure instead of "0", and then update the caller (do_check before calling evaluate_state) to detect an empty/unavailable level (e.g., level="$(echo "${state}" | cut -d'|' -f1)" and if [[ -z "${level}" ]]; then log a WARN like "Memory level unavailable; skipping this check cycle" and return 0) so severity evaluation is skipped for that cycle.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/memory-pressure-monitor.sh:
- Around line 300-351: The installation currently writes the LaunchAgent plist
at ${PLIST_PATH} in cmd_install without ensuring its parent directory exists;
update cmd_install to create the LaunchAgents directory (mkdir -p "$(dirname
"${PLIST_PATH}")") before redirecting the heredoc, and ensure proper
permissions/ownership if required. Alternatively, add the LaunchAgents path
creation to ensure_dirs (alongside LOG_DIR and STATE_DIR) so the directory
exists for cmd_install; reference cmd_install, PLIST_PATH, ensure_dirs, LOG_DIR
and STATE_DIR when making the change.
- Around line 94-103: The check_cooldown function reads a timestamp from the
cooldown file into last_notify and then performs arithmetic with now and
COOLDOWN_SECS without validating last_notify; add numeric validation and a safe
fallback so corrupted/non-numeric file contents won't break the script.
Specifically, after reading last_notify in check_cooldown, validate it is an
integer (e.g., with a shell regex like '^[0-9]+$' or a POSIX-safe test) and if
invalid or empty set last_notify=0 before computing elapsed, then continue the
existing cooldown comparison using COOLDOWN_SECS; keep variable names
last_notify, now, elapsed and the return semantics unchanged.
- Around line 31-35: Add a new validate_config() function and call it at the
start of main() to ensure MEMORY_WARN_THRESHOLD, MEMORY_CRIT_THRESHOLD, and
MEMORY_COOLDOWN_SECS are valid numbers: parse WARN_THRESHOLD and CRIT_THRESHOLD
as integers, ensure they are numeric and within 0–100, enforce WARN_THRESHOLD >
CRIT_THRESHOLD, and validate COOLDOWN_SECS is a non-negative integer; on any
failure, emit a clear error via SCRIPT_NAME and exit non-zero. Update the
existing numeric comparisons that use WARN_THRESHOLD, CRIT_THRESHOLD, and
COOLDOWN_SECS by relying on the validated/normalized integer variables produced
by validate_config() so string values won’t be implicitly coerced. Ensure
validate_config() handles missing env vars by falling back to the existing
defaults, and logs precise error messages before exiting to aid debugging.
---
Duplicate comments:
In @.agents/scripts/memory-pressure-monitor.sh:
- Around line 116-120: The get_memory_level function should not return "0" on
sysctl failure because that mimics critical memory pressure; change
get_memory_level to return an explicit empty/unavailable sentinel (e.g., empty
string) on failure instead of "0", and then update the caller (do_check before
calling evaluate_state) to detect an empty/unavailable level (e.g.,
level="$(echo "${state}" | cut -d'|' -f1)" and if [[ -z "${level}" ]]; then log
a WARN like "Memory level unavailable; skipping this check cycle" and return 0)
so severity evaluation is skipped for that cycle.
| readonly SCRIPT_NAME="memory-pressure-monitor" | ||
| readonly WARN_THRESHOLD="${MEMORY_WARN_THRESHOLD:-40}" | ||
| readonly CRIT_THRESHOLD="${MEMORY_CRIT_THRESHOLD:-20}" | ||
| readonly COOLDOWN_SECS="${MEMORY_COOLDOWN_SECS:-300}" | ||
| readonly NOTIFY_ENABLED="${MEMORY_NOTIFY:-true}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "memory-pressure-monitor.sh" -type fRepository: marcusquinn/aidevops
Length of output: 108
🏁 Script executed:
cat -n .agents/scripts/memory-pressure-monitor.shRepository: marcusquinn/aidevops
Length of output: 15243
🏁 Script executed:
#!/bin/bash
# First check if shellcheck is available, install if not
if ! command -v shellcheck &>/dev/null; then
echo "Installing shellcheck..."
apt-get update && apt-get install -y shellcheck 2>/dev/null || brew install shellcheck
fi
# Run ShellCheck
shellcheck -S warning .agents/scripts/memory-pressure-monitor.sh || trueRepository: marcusquinn/aidevops
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Test what happens with non-numeric threshold values
echo "=== Testing non-numeric threshold behavior ==="
# Simulate the script's behavior with invalid threshold
WARN_THRESHOLD="abc"
CRIT_THRESHOLD="20"
COOLDOWN_SECS="300"
level=35
echo "Test 1: Non-numeric WARN_THRESHOLD='abc', level=35"
if [[ ${level} -le ${CRIT_THRESHOLD} ]]; then
echo " Result: critical"
elif [[ ${level} -le ${WARN_THRESHOLD} ]]; then
echo " Result: warning"
else
echo " Result: normal"
fi 2>&1 || echo " ERROR: Arithmetic comparison failed"
echo ""
echo "Test 2: Inverted thresholds (WARN=20, CRIT=40, level=35)"
WARN_THRESHOLD=20
CRIT_THRESHOLD=40
if [[ ${level} -le ${CRIT_THRESHOLD} ]]; then
echo " Result: critical (WRONG!)"
elif [[ ${level} -le ${WARN_THRESHOLD} ]]; then
echo " Result: warning"
else
echo " Result: normal"
fi
echo ""
echo "Test 3: Non-numeric COOLDOWN_SECS behavior"
COOLDOWN_SECS="not_a_number"
elapsed=100
if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then
echo " Result: Still in cooldown"
else
echo " Result: Cooldown expired"
fi 2>&1 || echo " ERROR: Arithmetic comparison failed"Repository: marcusquinn/aidevops
Length of output: 335
🏁 Script executed:
#!/bin/bash
# Now test the suggested validation function logic
echo "=== Testing suggested validation function ==="
validate_config() {
for v in WARN_THRESHOLD CRIT_THRESHOLD COOLDOWN_SECS; do
if [[ ! "${!v}" =~ ^[0-9]+$ ]]; then
echo "Invalid ${v}: ${!v} (must be numeric)" >&2
return 1
fi
done
if (( WARN_THRESHOLD <= CRIT_THRESHOLD )); then
echo "Invalid thresholds: warning (${WARN_THRESHOLD}) must be greater than critical (${CRIT_THRESHOLD})" >&2
return 1
fi
if (( WARN_THRESHOLD > 100 || CRIT_THRESHOLD < 0 )); then
echo "Thresholds must be within 0-100" >&2
return 1
fi
}
# Test valid config
echo "Test 1: Valid config (WARN=40, CRIT=20, COOLDOWN=300)"
WARN_THRESHOLD=40
CRIT_THRESHOLD=20
COOLDOWN_SECS=300
validate_config && echo " ✓ Valid" || echo " ✗ Invalid"
echo ""
echo "Test 2: Invalid - non-numeric WARN_THRESHOLD"
WARN_THRESHOLD=abc
CRIT_THRESHOLD=20
COOLDOWN_SECS=300
validate_config && echo " ✓ Valid" || echo " ✗ Invalid"
echo ""
echo "Test 3: Invalid - inverted thresholds (WARN=20, CRIT=40)"
WARN_THRESHOLD=20
CRIT_THRESHOLD=40
COOLDOWN_SECS=300
validate_config && echo " ✓ Valid" || echo " ✗ Invalid"
echo ""
echo "Test 4: Invalid - out of range (WARN=150, CRIT=20)"
WARN_THRESHOLD=150
CRIT_THRESHOLD=20
COOLDOWN_SECS=300
validate_config && echo " ✓ Valid" || echo " ✗ Invalid"
echo ""
echo "Test 5: Invalid - non-numeric COOLDOWN_SECS"
WARN_THRESHOLD=40
CRIT_THRESHOLD=20
COOLDOWN_SECS=notanumber
validate_config && echo " ✓ Valid" || echo " ✗ Invalid"Repository: marcusquinn/aidevops
Length of output: 637
🏁 Script executed:
#!/bin/bash
# Verify the exact behavior: bash arithmetic and error handling with non-numeric values
echo "=== Understanding bash -le behavior with non-numeric strings ==="
echo "Test: bash -le with non-numeric string (no error exit)"
bash -c '
set -euo pipefail # Script uses this
val="abc"
if [[ 35 -le $val ]]; then
echo "True"
else
echo "False"
fi
echo "Script continued (no error)"
'
echo ""
echo "Test: bash arithmetic with non-numeric in (( )) expression"
bash -c '
set -euo pipefail
val="abc"
if (( 35 < val )); then
echo "True"
else
echo "False (or coerced to 0)"
fi
' 2>&1 || echo "Script failed with exit code $?"Repository: marcusquinn/aidevops
Length of output: 363
🏁 Script executed:
#!/bin/bash
# Final test: Verify actual silent failure behavior in the script's context
echo "=== Simulating actual script behavior with invalid inputs ==="
# Simulate evaluate_state with non-numeric threshold
evaluate_state() {
local state="$1"
local level total_gb swap_total swap_used compressor_segments swap_files
IFS='|' read -r level total_gb swap_total swap_used compressor_segments swap_files <<<"${state}"
if [[ ${level} -le ${CRIT_THRESHOLD} ]]; then
echo "critical"
elif [[ ${level} -le ${WARN_THRESHOLD} ]]; then
echo "warning"
else
echo "normal"
fi
}
# Test with non-numeric WARN_THRESHOLD
echo "Scenario 1: Non-numeric WARN_THRESHOLD (should warn at level=35)"
WARN_THRESHOLD="xyz"
CRIT_THRESHOLD=20
state="35|16.0|0|0|0|0"
result=$(evaluate_state "${state}")
echo " Input: level=35, WARN_THRESHOLD='xyz', CRIT_THRESHOLD=20"
echo " Result: ${result}"
echo " Expected: warning (level between thresholds)"
echo " ISSUE: Silent failure - returns ${result} (wrong!)"
echo ""
echo "Scenario 2: Inverted thresholds (WARN=20, CRIT=40, level=35)"
WARN_THRESHOLD=20
CRIT_THRESHOLD=40
result=$(evaluate_state "${state}")
echo " Input: level=35, WARN_THRESHOLD=20, CRIT_THRESHOLD=40"
echo " Result: ${result}"
echo " Expected: warning (but thresholds are backwards)"
echo " ISSUE: Returns critical when it should be warning!"
echo ""
echo "Scenario 3: Non-numeric COOLDOWN_SECS in cooldown check"
COOLDOWN_SECS="bad"
elapsed=100
if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then
echo " Result: Still in cooldown"
else
echo " Result: Cooldown expired (WRONG - should check numeric validity)"
fiRepository: marcusquinn/aidevops
Length of output: 728
Add validation for threshold and cooldown configuration values
Environment variables MEMORY_WARN_THRESHOLD, MEMORY_CRIT_THRESHOLD, and MEMORY_COOLDOWN_SECS are consumed in numeric comparisons (lines 102, 187, 189) without validation. Non-numeric values silently coerce to incorrect results—for example, WARN_THRESHOLD="abc" causes level 35 to be misclassified as "normal" instead of "warning". Additionally, inverted thresholds (warning ≤ critical) are not caught, causing false critical alerts.
Add a validate_config() function that validates numeric format, enforces WARN_THRESHOLD > CRIT_THRESHOLD, and checks that both fall within 0–100. Call this function at the start of main() before any monitoring logic executes.
Suggested validation function
+validate_config() {
+ for v in WARN_THRESHOLD CRIT_THRESHOLD COOLDOWN_SECS; do
+ if [[ ! "${!v}" =~ ^[0-9]+$ ]]; then
+ echo "Invalid ${v}: ${!v} (must be numeric)" >&2
+ return 1
+ fi
+ done
+ if (( WARN_THRESHOLD <= CRIT_THRESHOLD )); then
+ echo "Invalid thresholds: warning (${WARN_THRESHOLD}) must be greater than critical (${CRIT_THRESHOLD})" >&2
+ return 1
+ fi
+ if (( WARN_THRESHOLD > 100 || CRIT_THRESHOLD < 0 )); then
+ echo "Thresholds must be within 0-100" >&2
+ return 1
+ fi
+} main() {
local cmd="${1:-check}"
+ validate_config || return 1Per automation script guidelines: ensure reliability, robustness, and proper error handling with clear feedback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/memory-pressure-monitor.sh around lines 31 - 35, Add a new
validate_config() function and call it at the start of main() to ensure
MEMORY_WARN_THRESHOLD, MEMORY_CRIT_THRESHOLD, and MEMORY_COOLDOWN_SECS are valid
numbers: parse WARN_THRESHOLD and CRIT_THRESHOLD as integers, ensure they are
numeric and within 0–100, enforce WARN_THRESHOLD > CRIT_THRESHOLD, and validate
COOLDOWN_SECS is a non-negative integer; on any failure, emit a clear error via
SCRIPT_NAME and exit non-zero. Update the existing numeric comparisons that use
WARN_THRESHOLD, CRIT_THRESHOLD, and COOLDOWN_SECS by relying on the
validated/normalized integer variables produced by validate_config() so string
values won’t be implicitly coerced. Ensure validate_config() handles missing env
vars by falling back to the existing defaults, and logs precise error messages
before exiting to aid debugging.
| check_cooldown() { | ||
| local cooldown_file="$1" | ||
| if [[ -f "${cooldown_file}" ]]; then | ||
| local last_notify | ||
| last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)" | ||
| local now | ||
| now="$(date +%s)" | ||
| local elapsed=$((now - last_notify)) | ||
| if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then | ||
| return 1 # Still in cooldown |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -120 .agents/scripts/memory-pressure-monitor.sh | tail -40Repository: marcusquinn/aidevops
Length of output: 1092
🏁 Script executed:
grep -n "set -" .agents/scripts/memory-pressure-monitor.shRepository: marcusquinn/aidevops
Length of output: 84
🏁 Script executed:
cat -n .agents/scripts/memory-pressure-monitor.sh | sed -n '90,110p'Repository: marcusquinn/aidevops
Length of output: 638
🏁 Script executed:
# Check for shellcheck availability and run it on the script
if command -v shellcheck &>/dev/null; then
shellcheck -S warning .agents/scripts/memory-pressure-monitor.sh 2>&1 || true
else
echo "ShellCheck not available in sandbox"
fiRepository: marcusquinn/aidevops
Length of output: 46
🏁 Script executed:
# Look for how cooldown file content is typically initialized
rg "cooldown_file" .agents/scripts/memory-pressure-monitor.sh -B 2 -A 2Repository: marcusquinn/aidevops
Length of output: 343
🏁 Script executed:
# Check for numeric validation patterns elsewhere in the script
rg '\[0-9\]' .agents/scripts/memory-pressure-monitor.sh -B 1 -A 1Repository: marcusquinn/aidevops
Length of output: 46
🏁 Script executed:
# Look for error handling patterns
rg "set -euo pipefail|trap|return 1" .agents/scripts/memory-pressure-monitor.sh | head -20Repository: marcusquinn/aidevops
Length of output: 176
Validate cooldown timestamps before arithmetic to prevent script failure
The cooldown file content is not validated before use in arithmetic expansion on line 101. If the file becomes corrupted, the $((now - last_notify)) operation will fail under set -euo pipefail, causing script termination. The current fallback to echo 0 only handles read failures, not content validity.
Add numeric validation with graceful recovery:
Suggested fix
check_cooldown() {
local cooldown_file="$1"
if [[ -f "${cooldown_file}" ]]; then
local last_notify
last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)"
+ if [[ ! "${last_notify}" =~ ^[0-9]+$ ]]; then
+ log_msg "WARN" "Invalid cooldown value in ${cooldown_file}; resetting"
+ rm -f "${cooldown_file}"
+ return 0
+ fi
local now
now="$(date +%s)"
local elapsed=$((now - last_notify))
if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then
return 1 # Still in cooldownPer automation script guidelines: maintain reliability/robustness and robust error recovery mechanisms.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| check_cooldown() { | |
| local cooldown_file="$1" | |
| if [[ -f "${cooldown_file}" ]]; then | |
| local last_notify | |
| last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)" | |
| local now | |
| now="$(date +%s)" | |
| local elapsed=$((now - last_notify)) | |
| if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then | |
| return 1 # Still in cooldown | |
| check_cooldown() { | |
| local cooldown_file="$1" | |
| if [[ -f "${cooldown_file}" ]]; then | |
| local last_notify | |
| last_notify="$(cat "${cooldown_file}" 2>/dev/null || echo 0)" | |
| if [[ ! "${last_notify}" =~ ^[0-9]+$ ]]; then | |
| log_msg "WARN" "Invalid cooldown value in ${cooldown_file}; resetting" | |
| rm -f "${cooldown_file}" | |
| return 0 | |
| fi | |
| local now | |
| now="$(date +%s)" | |
| local elapsed=$((now - last_notify)) | |
| if [[ ${elapsed} -lt ${COOLDOWN_SECS} ]]; then | |
| return 1 # Still in cooldown |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/memory-pressure-monitor.sh around lines 94 - 103, The
check_cooldown function reads a timestamp from the cooldown file into
last_notify and then performs arithmetic with now and COOLDOWN_SECS without
validating last_notify; add numeric validation and a safe fallback so
corrupted/non-numeric file contents won't break the script. Specifically, after
reading last_notify in check_cooldown, validate it is an integer (e.g., with a
shell regex like '^[0-9]+$' or a POSIX-safe test) and if invalid or empty set
last_notify=0 before computing elapsed, then continue the existing cooldown
comparison using COOLDOWN_SECS; keep variable names last_notify, now, elapsed
and the return semantics unchanged.
| cmd_install() { | ||
| local script_path | ||
| script_path="$(cd "$(dirname "$0")" && pwd)/$(basename "$0")" | ||
|
|
||
| # Verify the script exists at the expected installed location | ||
| local installed_path="${HOME}/.aidevops/agents/scripts/${SCRIPT_NAME}.sh" | ||
| if [[ -x "${installed_path}" ]]; then | ||
| script_path="${installed_path}" | ||
| fi | ||
|
|
||
| cat >"${PLIST_PATH}" <<PLIST | ||
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> | ||
| <plist version="1.0"> | ||
| <dict> | ||
| <key>Label</key> | ||
| <string>${LAUNCHD_LABEL}</string> | ||
| <key>ProgramArguments</key> | ||
| <array> | ||
| <string>/bin/bash</string> | ||
| <string>${script_path}</string> | ||
| </array> | ||
| <key>StartInterval</key> | ||
| <integer>60</integer> | ||
| <key>StandardOutPath</key> | ||
| <string>${LOG_DIR}/memory-pressure-launchd.log</string> | ||
| <key>StandardErrorPath</key> | ||
| <string>${LOG_DIR}/memory-pressure-launchd.log</string> | ||
| <key>EnvironmentVariables</key> | ||
| <dict> | ||
| <key>PATH</key> | ||
| <string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string> | ||
| <key>HOME</key> | ||
| <string>${HOME}</string> | ||
| </dict> | ||
| <key>RunAtLoad</key> | ||
| <true/> | ||
| <key>KeepAlive</key> | ||
| <false/> | ||
| <key>ProcessType</key> | ||
| <string>Background</string> | ||
| <key>LowPriorityBackgroundIO</key> | ||
| <true/> | ||
| <key>Nice</key> | ||
| <integer>10</integer> | ||
| </dict> | ||
| </plist> | ||
| PLIST | ||
|
|
||
| launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true | ||
| launchctl bootstrap "gui/$(id -u)" "${PLIST_PATH}" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file size and get an overview
wc -l .agents/scripts/memory-pressure-monitor.shRepository: marcusquinn/aidevops
Length of output: 110
🏁 Script executed:
# Read the entire script to understand context
cat -n .agents/scripts/memory-pressure-monitor.shRepository: marcusquinn/aidevops
Length of output: 15243
🏁 Script executed:
# Run ShellCheck on the script
shellcheck .agents/scripts/memory-pressure-monitor.sh 2>&1 || trueRepository: marcusquinn/aidevops
Length of output: 46
Add directory creation to cmd_install before writing the LaunchAgent plist
The cmd_install function writes the plist file to ${PLIST_PATH} without ensuring the parent directory ~/Library/LaunchAgents/ exists. This causes bootstrap failures on fresh environments. Additionally, ensure_dirs() creates only ${LOG_DIR} and ${STATE_DIR}, but not the LaunchAgent directory.
Required fix
cmd_install() {
+ ensure_dirs
+ mkdir -p "$(dirname "${PLIST_PATH}")"
+
local script_path
script_path="$(cd "$(dirname "$0")" && pwd)/$(basename "$0")"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cmd_install() { | |
| local script_path | |
| script_path="$(cd "$(dirname "$0")" && pwd)/$(basename "$0")" | |
| # Verify the script exists at the expected installed location | |
| local installed_path="${HOME}/.aidevops/agents/scripts/${SCRIPT_NAME}.sh" | |
| if [[ -x "${installed_path}" ]]; then | |
| script_path="${installed_path}" | |
| fi | |
| cat >"${PLIST_PATH}" <<PLIST | |
| <?xml version="1.0" encoding="UTF-8"?> | |
| <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> | |
| <plist version="1.0"> | |
| <dict> | |
| <key>Label</key> | |
| <string>${LAUNCHD_LABEL}</string> | |
| <key>ProgramArguments</key> | |
| <array> | |
| <string>/bin/bash</string> | |
| <string>${script_path}</string> | |
| </array> | |
| <key>StartInterval</key> | |
| <integer>60</integer> | |
| <key>StandardOutPath</key> | |
| <string>${LOG_DIR}/memory-pressure-launchd.log</string> | |
| <key>StandardErrorPath</key> | |
| <string>${LOG_DIR}/memory-pressure-launchd.log</string> | |
| <key>EnvironmentVariables</key> | |
| <dict> | |
| <key>PATH</key> | |
| <string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string> | |
| <key>HOME</key> | |
| <string>${HOME}</string> | |
| </dict> | |
| <key>RunAtLoad</key> | |
| <true/> | |
| <key>KeepAlive</key> | |
| <false/> | |
| <key>ProcessType</key> | |
| <string>Background</string> | |
| <key>LowPriorityBackgroundIO</key> | |
| <true/> | |
| <key>Nice</key> | |
| <integer>10</integer> | |
| </dict> | |
| </plist> | |
| PLIST | |
| launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true | |
| launchctl bootstrap "gui/$(id -u)" "${PLIST_PATH}" | |
| cmd_install() { | |
| ensure_dirs | |
| mkdir -p "$(dirname "${PLIST_PATH}")" | |
| local script_path | |
| script_path="$(cd "$(dirname "$0")" && pwd)/$(basename "$0")" | |
| # Verify the script exists at the expected installed location | |
| local installed_path="${HOME}/.aidevops/agents/scripts/${SCRIPT_NAME}.sh" | |
| if [[ -x "${installed_path}" ]]; then | |
| script_path="${installed_path}" | |
| fi | |
| cat >"${PLIST_PATH}" <<PLIST | |
| <?xml version="1.0" encoding="UTF-8"?> | |
| <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> | |
| <plist version="1.0"> | |
| <dict> | |
| <key>Label</key> | |
| <string>${LAUNCHD_LABEL}</string> | |
| <key>ProgramArguments</key> | |
| <array> | |
| <string>/bin/bash</string> | |
| <string>${script_path}</string> | |
| </array> | |
| <key>StartInterval</key> | |
| <integer>60</integer> | |
| <key>StandardOutPath</key> | |
| <string>${LOG_DIR}/memory-pressure-launchd.log</string> | |
| <key>StandardErrorPath</key> | |
| <string>${LOG_DIR}/memory-pressure-launchd.log</string> | |
| <key>EnvironmentVariables</key> | |
| <dict> | |
| <key>PATH</key> | |
| <string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string> | |
| <key>HOME</key> | |
| <string>${HOME}</string> | |
| </dict> | |
| <key>RunAtLoad</key> | |
| <true/> | |
| <key>KeepAlive</key> | |
| <false/> | |
| <key>ProcessType</key> | |
| <string>Background</string> | |
| <key>LowPriorityBackgroundIO</key> | |
| <true/> | |
| <key>Nice</key> | |
| <integer>10</integer> | |
| </dict> | |
| </plist> | |
| PLIST | |
| launchctl bootout "gui/$(id -u)" "${PLIST_PATH}" 2>/dev/null || true | |
| launchctl bootstrap "gui/$(id -u)" "${PLIST_PATH}" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/memory-pressure-monitor.sh around lines 300 - 351, The
installation currently writes the LaunchAgent plist at ${PLIST_PATH} in
cmd_install without ensuring its parent directory exists; update cmd_install to
create the LaunchAgents directory (mkdir -p "$(dirname "${PLIST_PATH}")") before
redirecting the heredoc, and ensure proper permissions/ownership if required.
Alternatively, add the LaunchAgents path creation to ensure_dirs (alongside
LOG_DIR and STATE_DIR) so the directory exists for cmd_install; reference
cmd_install, PLIST_PATH, ensure_dirs, LOG_DIR and STATE_DIR when making the
change.
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
8 similar comments
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. CI status: FAIL (SonarCloud analysis failing, auto-fix check failing) Key issues from CodeRabbit review:
These should be addressed by the contributor before this can be merged. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. Review findings summary:
The core feature (memory pressure monitoring) is valuable and well-motivated. The security issues flagged by Gemini need addressing before merge. Maintainer review required. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
3 similar comments
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. Review findings summary:
The feature itself is valuable (motivated by a real kernel panic). The issues above need to be addressed by the contributor before this can be merged. |
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
gh pr edit --add-label fails with GraphQL Projects deprecation warning (exit 1), preventing the external-contributor label from being applied. This broke the idempotency guard, causing duplicate comments on every pulse cycle (17+ on PR #2792). Replace both instances (external-contributor gate and orphaned PR scanner) with gh api repos/<slug>/issues/<number>/labels REST endpoint which works correctly. Closes #2793
…sed on API errors Root cause: the previous guard (#2796) used 2>&1 to capture stderr into the output variable, but this meant API failures produced non-empty strings that didn't match the grep patterns, causing the if/elif/else chain to fall through to the 'post comment' branch on every pulse cycle (~2 min). PR #2792 accumulated 20+ identical comments from this bug. Three structural changes: 1. Separate exit code from output: capture $? independently so API failures are detected by exit code, not by parsing error text 2. Fail closed: if either the label check or comment check API call fails (non-zero exit), skip posting entirely — the next pulse retries 3. Independent checks: check label AND comment as separate boolean flags, then use a 3-way decision (API error → skip, already flagged → skip/re-label, clean → post). The old if/elif/else structure conflated 'API error' with 'not found'. Same fix applied to the permission-failure comment guard (outcome 3). Closes #2802
Root cause: the comment idempotency check used a pipe (gh pr view | grep) which silently returned 'no match' when the API call failed, causing the guard to fall through and post a duplicate comment every 2-minute pulse cycle. This produced 18 duplicate comments on PR #2792. Three fixes applied: 1. Fetch comments into a variable (like labels) instead of piping — so we can capture the exit code and detect API failures 2. Fail closed: if the comment fetch fails, skip posting (assume already flagged). The next pulse cycle retries when the API recovers 3. Replace 2>/dev/null with 2>&1 on label-add operations so failures are visible in logs Same pattern applied to the permission-failure comment guard. Closes #2802
…sed on API errors Root cause: the previous guard (#2796) used 2>&1 to capture stderr into the output variable, but this meant API failures produced non-empty strings that didn't match the grep patterns, causing the if/elif/else chain to fall through to the 'post comment' branch on every pulse cycle (~2 min). PR #2792 accumulated 20+ identical comments from this bug. Three structural changes: 1. Separate exit code from output: capture $? independently so API failures are detected by exit code, not by parsing error text 2. Fail closed: if either the label check or comment check API call fails (non-zero exit), skip posting entirely — the next pulse retries 3. Independent checks: check label AND comment as separate boolean flags, then use a 3-way decision (API error → skip, already flagged → skip/re-label, clean → post). The old if/elif/else structure conflated 'API error' with 'not found'. Same fix applied to the permission-failure comment guard (outcome 3). Closes #2802
…sed on API errors (#2803) Root cause: the previous guard (#2796) used 2>&1 to capture stderr into the output variable, but this meant API failures produced non-empty strings that didn't match the grep patterns, causing the if/elif/else chain to fall through to the 'post comment' branch on every pulse cycle (~2 min). PR #2792 accumulated 20+ identical comments from this bug. Three structural changes: 1. Separate exit code from output: capture $? independently so API failures are detected by exit code, not by parsing error text 2. Fail closed: if either the label check or comment check API call fails (non-zero exit), skip posting entirely — the next pulse retries 3. Independent checks: check label AND comment as separate boolean flags, then use a 3-way decision (API error → skip, already flagged → skip/re-label, clean → post). The old if/elif/else structure conflated 'API error' with 'not found'. Same fix applied to the permission-failure comment guard (outcome 3). Closes #2802
… function (t1391) The LLM supervisor kept re-implementing the inline bash idempotency guard incorrectly on each pulse cycle, causing 15+ duplicate 'external contributor' comments on PR #2792 despite 4 prior fix attempts (PRs #2794, #2796, #2801, #2803) — all in pulse.md prompt text. Root cause: the check is deterministic (one correct answer regardless of context) but was encoded as prompt guidance that the LLM had to re-implement each cycle. Per the 'Intelligence Over Determinism' principle, deterministic logic belongs in the harness, not the prompt. Changes: - Add check_external_contributor_pr() to pulse-wrapper.sh: checks BOTH label AND comment, captures exit codes separately, fails closed on any API error, only posts when confirmed safe - Add check_permission_failure_pr() companion function for the permission API failure case - Add source guard (BASH_SOURCE check) so pulse-wrapper.sh can be sourced for its functions without triggering the full lifecycle - Update pulse.md to call the helper functions instead of inline bash - Update Hard Rule 12 to reference the helper functions Closes #2809
… function (t1391) (#2810) The LLM supervisor kept re-implementing the inline bash idempotency guard incorrectly on each pulse cycle, causing 15+ duplicate 'external contributor' comments on PR #2792 despite 4 prior fix attempts (PRs #2794, #2796, #2801, #2803) — all in pulse.md prompt text. Root cause: the check is deterministic (one correct answer regardless of context) but was encoded as prompt guidance that the LLM had to re-implement each cycle. Per the 'Intelligence Over Determinism' principle, deterministic logic belongs in the harness, not the prompt. Changes: - Add check_external_contributor_pr() to pulse-wrapper.sh: checks BOTH label AND comment, captures exit codes separately, fails closed on any API error, only posts when confirmed safe - Add check_permission_failure_pr() companion function for the permission API failure case - Add source guard (BASH_SOURCE check) so pulse-wrapper.sh can be sourced for its functions without triggering the full lifecycle - Update pulse.md to call the helper functions instead of inline bash - Update Hard Rule 12 to reference the helper functions Closes #2809
|
Pulse note (2026-03-04): This PR has accumulated duplicate 'external contributor' comments due to a bug in the pulse idempotency guard (tracked in issue #2812 — fix dispatched). The label |
|
Supervisor note: A worker process (PID 3159030) was dispatched to fix the CodeRabbit review feedback on this PR. After 3+ hours with 0 commits (struggle_ratio: 8240), the worker was killed — it was thrashing without making progress. Additionally, this PR is from an external contributor (@johnwaldo, read permission). Per project policy, maintainers do not fix CI/review issues on external contributor PRs — that is the contributor's responsibility. @johnwaldo: Please address the CodeRabbit CHANGES_REQUESTED feedback directly on your branch. The issues to fix are documented in the review comments above. |
|
those memory pressure thresholds might be overaggressive, as the os can continue to run fine with up to say 200% utilisation if compression and swap drive is used |
|
Closing this PR — thank you @johnwaldo for the contribution and the motivation (the March 3 kernel panic was real and serious). After root-cause analysis, the actual memory problem is caused by aidevops's own processes, not general OS memory pressure:
The Replacement task created: #2854 (t1398) — covers process resource guards, shellcheck hardening, pulse self-watchdog, session awareness, and a rewritten memory monitor targeting process-level metrics instead of OS-level pressure. |
… [skip ci] Root-cause analysis of March 3 kernel panic identified three sources: shellcheck exponential expansion (5+ GB), zombie pulse processes, session accumulation (2.5+ GB). Closes PR #2792, creates GH#2854.
…398) Root-cause fix for runaway memory consumption (March 3 kernel panic): 1. Process guard in pulse-wrapper.sh: guard_child_processes() runs every 60s in the watchdog loop, killing any child process exceeding configurable RSS (2GB default) or runtime (10min default) limits. ShellCheck gets stricter limits (1GB/5min) due to known exponential expansion with --external-sources. 2. ShellCheck hardened: linters-local.sh now runs per-file with timeout (30s each) instead of batch mode. Prevents a single file with recursive source directives from consuming 5+ GB RAM. Quality sweep in pulse-wrapper.sh also gets per-file timeout. 3. Session count awareness: check_session_count() warns when >5 concurrent interactive sessions are open. Count displayed in the health issue dashboard with threshold warning. 4. Standalone process-guard-helper.sh: scan, kill-runaways, sessions, and status commands for manual/cron use. Monitors the right signals (individual RSS, runtime, process count) — replaces PR #2792 concept. Closes #2854
…mits, session awareness (#2855) * fix: add process guards, ShellCheck limits, and session awareness (t1398) Root-cause fix for runaway memory consumption (March 3 kernel panic): 1. Process guard in pulse-wrapper.sh: guard_child_processes() runs every 60s in the watchdog loop, killing any child process exceeding configurable RSS (2GB default) or runtime (10min default) limits. ShellCheck gets stricter limits (1GB/5min) due to known exponential expansion with --external-sources. 2. ShellCheck hardened: linters-local.sh now runs per-file with timeout (30s each) instead of batch mode. Prevents a single file with recursive source directives from consuming 5+ GB RAM. Quality sweep in pulse-wrapper.sh also gets per-file timeout. 3. Session count awareness: check_session_count() warns when >5 concurrent interactive sessions are open. Count displayed in the health issue dashboard with threshold warning. 4. Standalone process-guard-helper.sh: scan, kill-runaways, sessions, and status commands for manual/cron use. Monitors the right signals (individual RSS, runtime, process count) — replaces PR #2792 concept. Closes #2854 * fix: address review findings — validate env vars, fix ps field, reduce subshells (t1398) Security: add _validate_int() to process-guard-helper.sh to prevent command injection via arithmetic expansion of env vars (CHILD_RSS_LIMIT_KB, etc.). Mirrors the same pattern already used in pulse-wrapper.sh. Bug: change ps 'comm' field to 'command' so grep pattern 'node.*opencode' can match the full command line instead of just the executable name. Efficiency: replace awk/echo subshell spawning in while loops with read builtin for field parsing in both process-guard-helper.sh and pulse-wrapper.sh. Style: remove 2>/dev/null from mkdir -p and shellcheck invocations — the || true is sufficient to prevent set -e exits while keeping errors visible. * fix: address CodeRabbit review — portable timeout, TTY guard, env validation (t1398) - linters-local.sh: portable background+watcher timeout for shellcheck fallback when neither timeout nor gtimeout is available - process-guard-helper.sh: add tty field to ps, skip TTY-attached (interactive) processes from kill logic; _validate_int() sanitizes all numeric env vars before arithmetic expansion - pulse-wrapper.sh: skip CHILD_RUNTIME_LIMIT for primary pulse PID (governed by PULSE_STALE_THRESHOLD); skip shellcheck entirely when no timeout utility available to prevent runaway expansion * fix: correct session counting for Linux TTY and resolve SC2126 (t1398) - Replace grep|wc-l session counting with awk that excludes both '?' (Linux headless) and '??' (macOS headless) TTYs — fixes misclassification of headless processes as interactive sessions on Linux - Eliminates SC2126 (grep|wc-l) violations flagged by CodeFactor - Applied consistently across all 4 session counting locations: process-guard-helper.sh (cmd_scan, cmd_sessions, cmd_status) and pulse-wrapper.sh (check_session_count) Addresses CodeRabbit second review and CodeFactor CI failure. * fix: replace grep|wc-l with grep -c in ShellCheck result counting (SC2126) * fix: rename ppid_unused to _ppid to suppress SC2034 warning (t1398) * fix: address CodeRabbit round-2 review — TTY filtering, stderr, subshell reduction (t1398) - Fix session counting to filter both '?' (Linux) and '??' (macOS) headless TTY entries using awk instead of grep -v '??' (4 locations across process-guard-helper.sh and pulse-wrapper.sh check_session_count) - Remove blanket 2>/dev/null on ps commands in process-guard while loops and pulse-wrapper guard_child_processes; keep || true for error handling - Replace echo|tr|wc subshell pipeline with parameter expansion for colon counting in _get_process_age (zero subshell overhead) * fix: address remaining CodeRabbit review — comm→command, echo|grep→<<<, remove 2>/dev/null (t1398) pulse-wrapper.sh: - guard_child_processes: use 'command' field instead of 'comm' for full command line matching (node.*opencode pattern now works) - Extract cmd_base from full command path for limit selection - Replace echo|grep -c and echo|head with <<<$result herestrings in quality sweep loop (avoids spawning subshells per iteration) - Remove blanket 2>/dev/null from ps axo commands in prefetch_active_workers and _update_health_issue_for_repo (keep || true for pipeline failure tolerance) --------- Co-authored-by: marcusquinn <6428977+marcusquinn@users.noreply.github.com>
Rewrite memory pressure monitor to monitor the right signals: - Per-process RSS (warn 2GB, critical 4GB) instead of kern.memorystatus_level - Process runtime limits (10min shellcheck, 30min other tools) - Interactive session count (warn >= 5) - Aggregate aidevops RSS (warn >= 8GB) kern.memorystatus_level retained as secondary/informational signal only. Fixes security issues from declined PR #2792 (command injection via arithmetic eval, AppleScript injection, XML injection in plist). Includes 28 unit tests covering validation, cooldown, CLI, and security. Closes #2878
Rewrite memory pressure monitor to monitor the right signals: - Per-process RSS (warn 2GB, critical 4GB) instead of kern.memorystatus_level - Process runtime limits (10min shellcheck, 30min other tools) - Interactive session count (warn >= 5) - Aggregate aidevops RSS (warn >= 8GB) kern.memorystatus_level retained as secondary/informational signal only. Fixes security issues from declined PR #2792 (command injection via arithmetic eval, AppleScript injection, XML injection in plist). Includes 28 unit tests covering validation, cooldown, CLI, and security. Closes #2878
Summary
memory-pressure-monitor.sh— early warning system for macOS memory pressurekern.memorystatus_level, swap file count, and compressor segmentsterminal-notifierorosascript) at warning (<=40%) and critical (<=20%) thresholdsMotivation
Kernel panic on March 3 caused by memory exhaustion cascade: 46 swap files, 100% compressor segment limit, watchdog timeout after 94 seconds. The system had no early warning — by the time pressure was visible, it was too late.
Features
--check--status--daemon--installsh.aidevops.memory-pressure-monitor)--uninstallDesign
kern.memorystatus_level(kernel-maintained, instant read, 0-100)ProcessType: Background,Nice: 10,LowPriorityBackgroundIO: truein launchd plistterminal-notifierTesting
--status,--check(normal path),--check(triggered warning/critical via threshold override),--helpSummary by CodeRabbit