fix: prevent ShellCheck memory explosion with RSS watchdog, rate limiting, and PATH fix#3105
fix: prevent ShellCheck memory explosion with RSS watchdog, rate limiting, and PATH fix#3105johnwaldo wants to merge 4 commits intomarcusquinn:mainfrom
Conversation
…ting, and PATH fix Three defenses against the Mar 7 crash (18.5 GB shellcheck, 31 GB total): shellcheck-wrapper.sh: - Replace broken ulimit -v (no-op on macOS ARM — EINVAL) with background RSS watchdog that polls every 2s and kills at 1 GB RSS - Add respawn rate limiter with exponential backoff (5s, 10s, 20s... 300s) to prevent kill-respawn-grow cycles from bash-language-server - Add 120s hard timeout as additional safety net memory-pressure-monitor.sh (v2.1.0): - Lower thresholds: warn at 1 GB, kill at 2 GB (was 2/4 GB) - Lower shellcheck runtime: 5 min (was 10 min) - Fix zsh false positives: match MONITORED_PATTERNS against command basename only, not full command line with arguments - Adaptive polling: 10s when shellcheck detected, 30s normal (was 60s) - Launchd plist interval: 30s (was 60s) setup-modules/shell-env.sh: - Fix PATH ordering: strip existing shim_dir from PATH before prepending to guarantee ~/.aidevops/bin is FIRST, not appended at end - Document ulimit -v macOS ARM limitation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughThis PR implements multi-layered resource guards for ShellCheck to prevent memory pressure incidents. Changes include adaptive process monitoring with dynamic polling intervals, RSS-based process termination with exponential backoff rate limiting, and robust PATH sanitization to ensure consistent wrapper invocation. Configuration thresholds are lowered across per-process limits and runtime constraints. Changes
Sequence DiagramsequenceDiagram
participant Client as Invoker
participant Wrapper as ShellCheck Wrapper
participant RateLimit as Rate Limiter
participant Watchdog as RSS Watchdog
participant ShellCheck as Real ShellCheck
participant Monitor as Memory Monitor
Client->>Wrapper: shellcheck [args]
Wrapper->>RateLimit: _check_rate_limit()
alt Rate limit active
RateLimit-->>Wrapper: return empty JSON
Wrapper-->>Client: no-diagnostics response
else Rate limit clear
RateLimit-->>Wrapper: proceed
Wrapper->>Wrapper: filter args (remove --external-sources)
Wrapper->>Wrapper: apply ulimit -v constraint
Wrapper->>ShellCheck: spawn background process
Wrapper->>Watchdog: _start_watchdog(pid)
par
ShellCheck-->>ShellCheck: run analysis
Watchdog->>Monitor: poll RSS periodically
Monitor-->>Watchdog: current memory usage
alt RSS > LIMIT or timeout elapsed
Watchdog->>ShellCheck: kill process
Watchdog->>RateLimit: _record_kill()
end
end
Wrapper->>ShellCheck: wait for exit
Wrapper->>Watchdog: terminate monitor
ShellCheck-->>Wrapper: exit status & diagnostics
Wrapper-->>Client: return ShellCheck output
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical system stability issues caused by ShellCheck memory explosions, particularly on macOS ARM where Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust, multi-layered defense against the ShellCheck memory explosion issue, incorporating an RSS watchdog, respawn rate limiting, and a hard timeout in shellcheck-wrapper.sh. However, a critical Path Injection vulnerability was identified in setup-modules/shell-env.sh, where the PATH manipulation for launchctl can introduce a trailing colon, posing a significant security risk. Additionally, several instances of 2>/dev/null were identified that suppress potentially important errors, which has been addressed. A race condition in _record_kill was also noted. While the adaptive polling in memory-pressure-monitor.sh and the PATH ordering fix in setup-modules/shell-env.sh are positive improvements, further suggestions are provided to enhance the overall efficiency and robustness of the new shell scripts.
| # Record that a kill happened (called by the watchdog) | ||
| _record_kill() { | ||
| mkdir -p "$BACKOFF_DIR" 2>/dev/null || true | ||
|
|
||
| local kill_count=0 | ||
| if [[ -f "$BACKOFF_FILE" ]]; then | ||
| read -r kill_count _ <"$BACKOFF_FILE" 2>/dev/null || kill_count=0 | ||
| [[ "$kill_count" =~ ^[0-9]+$ ]] || kill_count=0 | ||
| fi | ||
|
|
||
| kill_count=$((kill_count + 1)) | ||
| printf '%s %s\n' "$kill_count" "$(date +%s)" >"$BACKOFF_FILE" | ||
| } |
There was a problem hiding this comment.
The _record_kill function has a race condition. If multiple shellcheck processes are killed concurrently, their wrapper instances could call this function at the same time. The read-modify-write sequence on $BACKOFF_FILE is not atomic, which could lead to an incorrect kill_count and a shorter backoff period than intended, partially defeating the rate-limiting mechanism.
To fix this, you can use a lock to ensure the file update is atomic. A common and portable way to do this in shell is using mkdir as a lock.
| # Record that a kill happened (called by the watchdog) | |
| _record_kill() { | |
| mkdir -p "$BACKOFF_DIR" 2>/dev/null || true | |
| local kill_count=0 | |
| if [[ -f "$BACKOFF_FILE" ]]; then | |
| read -r kill_count _ <"$BACKOFF_FILE" 2>/dev/null || kill_count=0 | |
| [[ "$kill_count" =~ ^[0-9]+$ ]] || kill_count=0 | |
| fi | |
| kill_count=$((kill_count + 1)) | |
| printf '%s %s\n' "$kill_count" "$(date +%s)" >"$BACKOFF_FILE" | |
| } | |
| _record_kill() { | |
| mkdir -p "$BACKOFF_DIR" || true | |
| local lock_dir="${BACKOFF_DIR}/shellcheck.lock" | |
| # Use mkdir for an atomic lock. If it fails, another process holds the lock. | |
| if mkdir "$lock_dir" 2>/dev/null; then | |
| # Ensure lock is removed on function exit | |
| trap 'rmdir "$lock_dir" 2>/dev/null' RETURN | |
| local kill_count=0 | |
| if [[ -f "$BACKOFF_FILE" ]]; then | |
| read -r kill_count _ <"$BACKOFF_FILE" 2>/dev/null || kill_count=0 | |
| [[ "$kill_count" =~ ^[0-9]+$ ]] || kill_count=0 | |
| fi | |
| kill_count=$((kill_count + 1)) | |
| printf '%s %s\n' "$kill_count" "$(date +%s)" >"$BACKOFF_FILE" | |
| fi | |
| } |
setup-modules/shell-env.sh
Outdated
| clean_path=$(printf '%s' "$PATH" | tr ':' '\n' | grep -v "^${shim_dir}$" | tr '\n' ':' | sed 's/:$//') | ||
| if launchctl setenv PATH "${shim_dir}:${clean_path}" 2>/dev/null; then |
There was a problem hiding this comment.
The logic used to clean and prepend the shim directory to the PATH for launchctl setenv can introduce a trailing colon or a double colon. In shell PATH resolution, an empty entry (represented by a leading colon, a trailing colon, or two consecutive colons) is equivalent to the current directory (.). This is a security risk as it allows an attacker to place a malicious binary in a directory and have it executed if the user runs a command that is not found elsewhere in the PATH. If clean_path is empty (e.g., if PATH was empty or only contained shim_dir), the resulting PATH will be ${shim_dir}:. Additionally, suppressing stderr with 2>/dev/null can mask important errors.
| clean_path=$(printf '%s' "$PATH" | tr ':' '\n' | grep -v "^${shim_dir}$" | tr '\n' ':' | sed 's/:$//') | |
| if launchctl setenv PATH "${shim_dir}:${clean_path}" 2>/dev/null; then | |
| local clean_path | |
| clean_path=$(printf '%s' "$PATH" | tr ':' '\n' | grep -v "^${shim_dir}$" | tr '\n' ':' | sed 's/:$//') | |
| if [[ -n "$clean_path" ]]; then | |
| if launchctl setenv PATH "${shim_dir}:${clean_path}"; then | |
| print_info "Prepended $shim_dir to PATH via launchctl (GUI processes)" | |
| fi | |
| else | |
| if launchctl setenv PATH "${shim_dir}"; then | |
| print_info "Set PATH to $shim_dir via launchctl (GUI processes)" | |
| fi | |
| fi |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| # 1st kill: 5s, 2nd: 10s, 3rd: 20s, ... up to MAX_BACKOFF (300s). | ||
| # Resets after MAX_BACKOFF seconds of no kills. | ||
| _check_rate_limit() { | ||
| mkdir -p "$BACKOFF_DIR" 2>/dev/null || true |
There was a problem hiding this comment.
Using 2>/dev/null with mkdir -p suppresses potentially important errors, such as 'Permission denied'. If the backoff directory cannot be created, the rate-limiting feature will fail silently. This practice is also discouraged by the project's general rules.
Consider removing 2>/dev/null to make such errors visible for debugging. The || true will still prevent the script from exiting due to set -e.
This also applies to line 154 in _record_kill.
| mkdir -p "$BACKOFF_DIR" 2>/dev/null || true | |
| mkdir -p "$BACKOFF_DIR" || true |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
|
This PR is from an external contributor (@johnwaldo). Auto-merge is disabled for external PRs — a maintainer must review and merge manually. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/shellcheck-wrapper.sh:
- Around line 33-35: The three environment tunables RSS_LIMIT_MB,
WATCHDOG_INTERVAL, and HARD_TIMEOUT are used directly in arithmetic, sleep and
numeric comparisons and must be validated; add a small validation routine (e.g.,
validate_positive_int) that checks each of these variables (readonly
RSS_LIMIT_MB, WATCHDOG_INTERVAL, HARD_TIMEOUT) against a strict integer/positive
regex or integer parse, coerce or replace non-numeric/zero values with safe
defaults (1024, 2, 120 respectively) and log a warning, and clamp excessively
small values (e.g., WATCHDOG_INTERVAL >= 1) before any use in $(( )), sleep or
-gt comparisons to avoid tight loops or premature aborts; call this validation
early (immediately after the readonly assignments) so all later watchdog logic
uses the sanitized values.
In `@setup-modules/shell-env.sh`:
- Around line 559-572: The PATH sanitization you added for macOS launchctl
doesn't update existing interactive/login shells; update the rc-file layer(s)
that currently add a presence-only guard (e.g., the code that sets PATH in
.zshenv/.bashrc/.profile and the fish config) to reuse the same
sanitize-and-prepend logic used with shim_dir/wrapper_path: compute a clean_path
by splitting $PATH, removing any exact occurrences of shim_dir (or
"$HOME/.aidevops/bin"), then export PATH as shim_dir followed by the cleaned
rest (preserving the empty-case), and for fish use string match -v to build PATH
similarly; ensure you reference the same shim_dir variable name and keep the
launchctl setenv SHELLCHECK_PATH/wrapper_path logic unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3d21c565-374d-454c-88ff-097a91fdec51
📒 Files selected for processing (3)
.agents/scripts/memory-pressure-monitor.sh.agents/scripts/shellcheck-wrapper.shsetup-modules/shell-env.sh
… fix PATH security Address CHANGES_REQUESTED review from CodeRabbit and Gemini Code Assist: shellcheck-wrapper.sh: - Add _validate_int() to sanitize RSS_LIMIT_MB, WATCHDOG_INTERVAL, HARD_TIMEOUT before use in arithmetic/sleep (prevents tight loops from typos like 'abc') - Fix race condition in _record_kill() with mkdir-based atomic lock - Remove blanket 2>/dev/null from mkdir -p (surface permission errors) - Add explicit return 0 to _record_kill() setup-modules/shell-env.sh: - Fix PATH trailing colon security issue in launchctl setenv (empty clean_path produced 'shim_dir:' which resolves '.' — PATH injection vector) - Replace presence-only case guard with sanitize-and-prepend logic that strips existing shim_dir from PATH before prepending (fixes upgrade path where shim was appended at end, bypassing wrapper in new shells) - Clean up stale old-form case-guard entries from .zshenv and rc files - Update fish shell PATH line to use same sanitize-and-prepend pattern CI failures (SonarCloud, Label PR, Monitor) are all fork permission issues (403 'Resource not accessible by integration') — expected for external PRs.
|
Worker killed after 6h43m with 0 commits (struggle_ratio: 17907). Thrashing detected — re-dispatching with fresh context. The worker was unable to make progress on the CI failures and review feedback. |
|
Worker killed after 9+ hours with struggle_ratio 9310 (18621 messages, 2 commits). Severe thrashing — worker was looping on review feedback without making progress. Re-dispatching with fresh context. |
|
@coderabbitai review All review feedback has been addressed in commits 55fff8e, 495dc96, and cd9837c:
ShellCheck: zero violations on all 3 files. |
|
Triggering a fresh review of the latest commits now. 🧠 Learnings used✅ Actions performedReview triggered.
|
|
Closing in favor of a new PR from the rebased branch on origin. The fork branch was based on an older main, causing stale version/planning file diffs. All review feedback from CodeRabbit and Gemini Code Assist has been addressed. The new PR will have a clean diff with only the 3 relevant files. |
Summary
Fixes the recurring ShellCheck memory explosion that caused a system crash on Mar 7 (20 GB → 48 GB+ in minutes, 18.5 GB single shellcheck process).
Root cause analysis:
ulimit -vis a complete no-op on macOS ARM (setrlimit failed: invalid argument) — the wrapper's memory cap was doing nothingThree-layer fix:
1. shellcheck-wrapper.sh — RSS watchdog + rate limiter
ulimit -v: polls child process RSS every 2s, kills at 1 GB{"comments":[]}instead of running when in backoffulimit -vdependency)2. memory-pressure-monitor.sh v2.1.0 — faster detection
zsh -l -c opencode)3. setup-modules/shell-env.sh — PATH ordering fix
~/.aidevops/binfrom PATH before prepending, guaranteeing first positionTesting
launchctl getenv PATH— shim dir is firstCrash timeline (from logs)
With this fix, the wrapper would have killed each process at 1 GB (not 4 GB), and the rate limiter would have prevented the respawn cycle entirely.
Summary by CodeRabbit
New Features
Bug Fixes
Chores