Skip to content

t1113: Add worker_never_started diagnostic and auto-retry with environment check#1980

Merged
marcusquinn merged 2 commits intomainfrom
feature/t1113
Feb 19, 2026
Merged

t1113: Add worker_never_started diagnostic and auto-retry with environment check#1980
marcusquinn merged 2 commits intomainfrom
feature/t1113

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 19, 2026

Summary

Prevents wasting retries on environment issues where the AI CLI is broken. On Feb 13, 5 tasks failed with worker_never_started:no_sentinel — the CLI was invoked but never produced output. Each failure burned a retry, exhausting max_retries on an infrastructure problem that no task retry would fix.

Changes

1. Pre-dispatch CLI health check (dispatch.sh)

  • New check_cli_health() function verifies the AI CLI binary exists and can execute before spawning workers
  • Checks: command -v for PATH presence, then lightweight version/help command with 10s timeout
  • Caches result per-pulse and for 5 minutes (file-based) to avoid repeated probes
  • On failure: defers dispatch to next pulse (return 3) instead of proceeding — no worktree created, no retry burned
  • Added cli_health=ok to dispatch metadata log for auditability

2. ENVIRONMENT failure mode (evaluate.sh)

  • New failure category ENVIRONMENT for dispatch infrastructure failures
  • Reclassified worker_never_started*, log_file_missing*, log_file_empty, no_log_path_in_db*, dispatch_script_not_executable from LOGIC to ENVIRONMENT
  • These failures indicate the CLI/environment is broken, not the task itself

3. Re-queue without burning retries (pulse.sh)

  • When pulse detects an ENVIRONMENT failure, re-queues the task (evaluating -> queued) without incrementing retry count
  • Invalidates CLI health cache so next dispatch re-verifies the environment
  • Stores failure pattern for diagnostics without marking as permanent task failure
  • Proof-log entry records failure_mode=ENVIRONMENT,retry_preserved=true

Verification

  • All 3 modified .sh files pass ShellCheck with zero violations
  • evaluating:queued is an existing valid state transition
  • failed:queued is also valid (used by existing recovery paths)

Ref #1664

Summary by CodeRabbit

Release Notes

  • New Features

    • Added CLI health verification before task dispatch to ensure required tools are available before processing begins.
  • Improvements

    • Introduced environment failure classification to prevent wasted retries on infrastructure issues.
    • Enhanced failure handling to intelligently defer environment-related failures without consuming retry attempts.

…ed failures (t1113)

Before dispatch, verify the AI CLI binary exists and can execute. If the
check fails, defer dispatch to next pulse instead of spawning a worker
that will never produce output. This prevents burning retries on
environment issues (missing binary, broken installation, PATH problems).

The check_cli_health() function:
- Verifies binary exists in PATH (command -v)
- Runs lightweight version check with 10s timeout
- Caches result per-pulse and for 5 minutes (file-based)
- Logs diagnostic detail on failure for troubleshooting

Also adds cli_health=ok to dispatch metadata log for auditability.
…without burning retries (t1113)

Reclassify worker_never_started, log_file_missing, log_file_empty, and
related dispatch infrastructure failures from LOGIC to new ENVIRONMENT
category. These failures indicate the CLI/environment is broken, not the
task itself.

When the pulse detects an ENVIRONMENT failure:
- Re-queues the task (evaluating -> queued) without incrementing retry count
- Invalidates CLI health cache so next dispatch re-verifies the environment
- Stores failure pattern for diagnostics without marking as task failure
- Logs proof-log entry with failure_mode=ENVIRONMENT for auditability

This prevents the Feb 13 scenario where 5 tasks exhausted their retries
on worker_never_started:no_sentinel when the underlying issue was a
broken CLI environment that no amount of task retries would fix.
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the task dispatch system by intelligently handling environment-related failures. It introduces proactive checks for AI CLI health and a new failure classification, ensuring that tasks encountering infrastructure problems are re-queued without consuming valuable retries, thereby improving system efficiency and preventing tasks from being prematurely marked as failed due to external factors.

Highlights

  • Pre-dispatch CLI Health Check: A new check_cli_health() function was added to dispatch.sh to verify the AI CLI binary's existence and executability before spawning workers. This check includes PATH verification and a lightweight version/help command with a timeout, caching results to prevent redundant checks. If the CLI is unhealthy, dispatch is deferred, preventing worktree creation and retry burning.
  • New ENVIRONMENT Failure Mode: A new ENVIRONMENT failure category was introduced in evaluate.sh for dispatch infrastructure failures. Several existing failure patterns like worker_never_started and log_file_missing were reclassified from LOGIC to ENVIRONMENT, indicating issues with the environment rather than the task logic itself.
  • Smart Re-queue for Environment Failures: The pulse.sh script was updated to detect ENVIRONMENT failures. When such a failure occurs, the task is re-queued (evaluating -> queued) without incrementing its retry count, preserving retries for actual task-related issues. The CLI health cache is also invalidated to force a re-verification of the environment on the next dispatch attempt.
Changelog
  • .agents/scripts/supervisor/dispatch.sh
    • Added a new check_cli_health function to verify the AI CLI binary's presence and executability, including caching mechanisms.
    • Integrated the check_cli_health function into the cmd_dispatch logic, deferring dispatch if the CLI is unhealthy.
    • Included cli_health=ok in the dispatch metadata log for successful CLI health checks.
  • .agents/scripts/supervisor/evaluate.sh
    • Introduced a new ENVIRONMENT failure category to classify infrastructure-related dispatch issues.
    • Reclassified specific failure patterns (e.g., worker_never_started, log_file_missing) from the LOGIC category to the new ENVIRONMENT category.
  • .agents/scripts/supervisor/pulse.sh
    • Modified the failed task handling to classify failures using the new classify_failure_mode function.
    • Implemented conditional logic to re-queue tasks with ENVIRONMENT failures back to queued status without incrementing their retry count.
    • Added logic to invalidate the CLI health cache when an ENVIRONMENT failure occurs, ensuring a fresh check on the next dispatch.
Activity
  • The pull request was opened by marcusquinn.
  • No human activity (comments, reviews) has occurred yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Walkthrough

Added CLI health verification function with caching to validate AI CLI binary availability before dispatch. Introduced ENVIRONMENT failure category to classify infrastructure/dispatch issues separately from logic errors. Integrated environment-aware failure handling in pulse Phase 1 that requeues without consuming retries, invalidates health cache, and performs pre-dispatch verification in dispatch flow.

Changes

Cohort / File(s) Summary
CLI Health Check Infrastructure
.agents/scripts/supervisor/dispatch.sh
Introduced check_cli_health() function verifying CLI binary existence, executability, and version output usability. Includes pulse-level fast path and 5-minute file-based cache. Integrated into cmd_dispatch before worktree creation with deferral (return code 3) on failure. Added cli_health=ok metadata to dispatch logs.
Environment Failure Classification
.agents/scripts/supervisor/evaluate.sh
Added new ENVIRONMENT category for infrastructure/dispatch issues alongside existing TRANSIENT, RESOURCE, LOGIC, BLOCKED, AMBIGUOUS categories. Reclassified patterns (worker_never_started*, log_file_missing*, log_file_empty, no_log_path_in_db*, dispatch_script_not_executable) from LOGIC/TRANSIENT to ENVIRONMENT with deferred retry behavior.
Environment-Aware Failure Handling
.agents/scripts/supervisor/pulse.sh
Modified Phase 1 failure handling to classify failures via classify_failure_mode. ENVIRONMENT failures requeue without incrementing retry counter, write environment_failure proof-log, invalidate CLI health cache, and reset health-verify flag. Non-environment failures preserve original logic.

Sequence Diagram

sequenceDiagram
    participant Pulse as Pulse Supervisor
    participant Dispatch as Dispatch Flow
    participant Evaluate as Evaluate (Classify)
    participant Health as CLI Health Check
    participant Worker as Worker Process
    participant Cache as Health Cache

    Pulse->>Dispatch: Phase 1 - Check for work
    Dispatch->>Health: check_cli_health()
    Health->>Cache: Check 5-min cache
    alt Cache Valid
        Health->>Health: Return cached result
    else Cache Expired
        Health->>Health: Verify CLI binary exists
        Health->>Health: Verify executable & version
        Health->>Cache: Store result
    end
    Health-->>Dispatch: Health status (0=ok, 1=fail)
    
    alt CLI Healthy
        Dispatch->>Worker: Spawn worker process
        Dispatch-->>Pulse: Return 0
    else CLI Unhealthy
        Dispatch-->>Pulse: Return 3 (defer)
    end

    alt Evaluation Phase
        Pulse->>Evaluate: classify_failure_mode()
        Evaluate-->>Pulse: ENVIRONMENT or other
        alt Is ENVIRONMENT
            Pulse->>Pulse: Requeue (no retry++)
            Pulse->>Cache: Invalidate health cache
            Pulse->>Worker: Cleanup process tree
        else Non-ENVIRONMENT
            Pulse->>Pulse: Original failure logic
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Poem

🏥 A CLI breathes before it runs,
Environment gremlins meet their match—
No wasteful retry, cached and spun,
Just defer with grace and a gentle patch.
Infrastructure speaks, we now listen true. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding a worker_never_started diagnostic mechanism and implementing environment-aware retry logic with CLI health checks across dispatch, evaluate, and pulse scripts.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1113

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 22:48:06 UTC 2026: Code review monitoring started
Thu Feb 19 22:48:07 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 22:48:09 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 19, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.agents/scripts/supervisor/dispatch.sh (2)

1530-1534: Document the external unset contract for _PULSE_CLI_VERIFIED.

_PULSE_CLI_VERIFIED is never cleared within this function or file — its lifecycle depends on pulse.sh unsetting it on ENVIRONMENT failures. The function comment only describes the fast-path behaviour, not the obligation on the caller. A future refactor or new entrypoint that bypasses pulse.sh would silently skip all CLI health checks for the entire process lifetime.

Consider adding a brief note (matching the existing _PULSE_HEALTH_VERIFIED pattern) so the coupling is explicit:

✏️ Suggested doc addition
 	# Pulse-level fast path: if CLI was already verified in this pulse, skip
+	# NOTE: _PULSE_CLI_VERIFIED must be unset by pulse.sh after ENVIRONMENT failures
+	# so the next pulse re-probes the CLI. See pulse.sh ENVIRONMENT failure handler.
 	if [[ -n "${_PULSE_CLI_VERIFIED:-}" ]]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/dispatch.sh around lines 1530 - 1534, The
fast-path uses the external flag _PULSE_CLI_VERIFIED but its lifecycle is
controlled elsewhere (pulse.sh) and is never unset in this file; add a short
comment above the check for _PULSE_CLI_VERIFIED in dispatch.sh (matching the
existing _PULSE_HEALTH_VERIFIED pattern) documenting that callers (pulse.sh)
must unset this variable on ENVIRONMENT failures and that other entrypoints must
honor/replicate that contract to avoid silently skipping CLI health checks for
the process lifetime; reference the variable name _PULSE_CLI_VERIFIED and
pulse.sh in the comment so future maintainers know where the unset happens and
the obligation to clear it.

978-984: check_cli_health is not called in do_prompt_repeat or cmd_reprompt, leaving those dispatch paths unguarded.

The PR's stated goal is to avoid wasting retries when the CLI environment is broken. Both do_prompt_repeat (line 980) and cmd_reprompt (line 3001) call check_model_health only, without verifying CLI binary health. Since both can spawn real workers and consume retry counts, an environment where the opencode binary is missing or broken would be caught on the first dispatch via cmd_dispatch, but subsequent prompt-repeat and reprompt cycles would still proceed and fail.

Consider adding the same guard to both functions:

♻️ Proposed addition to do_prompt_repeat (~Line 978)
 	# Pre-dispatch health check
 	local health_model health_exit=0
+	local cli_health_rc=0
+	check_cli_health "$ai_cli" >/dev/null || cli_health_rc=$?
+	if [[ "$cli_health_rc" -ne 0 ]]; then
+		log_error "do_prompt_repeat: CLI health check failed ($ai_cli) — deferring"
+		return 1
+	fi
 	health_model=$(resolve_model "health" "$ai_cli")

Apply the same pattern to cmd_reprompt near line 3001. The redirect to /dev/null discards diagnostic stdout while preserving stderr emission.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/dispatch.sh around lines 978 - 984,
do_prompt_repeat and cmd_reprompt currently only call check_model_health,
leaving CLI binary failures unguarded; add the same check_cli_health guard used
in cmd_dispatch: after resolving the ai_cli and health_model (using
resolve_model "$ai_cli"), call check_cli_health "$ai_cli" "$health_model" and
capture its exit code (e.g., cli_exit=0; check_cli_health ... || cli_exit=$?);
if cli_exit is non‑zero log a warning like the existing log_warn message and
return 1 to abort the prompt-repeat/reprompt path so retries aren't wasted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1562-1595: The health-check incorrectly initializes version_exit=1
and uses the "cmd || version_exit=$?" idiom, causing successful commands to
leave version_exit at 1 and producing false positives; change each invocation
that currently does version_output=$(... ) || version_exit=$? to first capture
output then immediately set version_exit=$? (e.g., version_output=$(... 2>&1);
version_exit=$?) for all branches that run the CLI (the blocks referencing
timeout_cmd, ai_cli, and the "opencode" vs "claude CLI" branches), leaving the
rest of the logic (the check using version_exit, version_output, cli_cache_file,
_PULSE_CLI_VERIFIED, and log_info) unchanged.

---

Nitpick comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1530-1534: The fast-path uses the external flag
_PULSE_CLI_VERIFIED but its lifecycle is controlled elsewhere (pulse.sh) and is
never unset in this file; add a short comment above the check for
_PULSE_CLI_VERIFIED in dispatch.sh (matching the existing _PULSE_HEALTH_VERIFIED
pattern) documenting that callers (pulse.sh) must unset this variable on
ENVIRONMENT failures and that other entrypoints must honor/replicate that
contract to avoid silently skipping CLI health checks for the process lifetime;
reference the variable name _PULSE_CLI_VERIFIED and pulse.sh in the comment so
future maintainers know where the unset happens and the obligation to clear it.
- Around line 978-984: do_prompt_repeat and cmd_reprompt currently only call
check_model_health, leaving CLI binary failures unguarded; add the same
check_cli_health guard used in cmd_dispatch: after resolving the ai_cli and
health_model (using resolve_model "$ai_cli"), call check_cli_health "$ai_cli"
"$health_model" and capture its exit code (e.g., cli_exit=0; check_cli_health
... || cli_exit=$?); if cli_exit is non‑zero log a warning like the existing
log_warn message and return 1 to abort the prompt-repeat/reprompt path so
retries aren't wasted.

Comment on lines +1562 to +1595
local version_output=""
local version_exit=1

# Use timeout to prevent hanging on broken installations
local timeout_cmd=""
if command -v gtimeout &>/dev/null; then
timeout_cmd="gtimeout"
elif command -v timeout &>/dev/null; then
timeout_cmd="timeout"
fi

if [[ "$ai_cli" == "opencode" ]]; then
if [[ -n "$timeout_cmd" ]]; then
version_output=$("$timeout_cmd" 10 "$ai_cli" version 2>&1) || version_exit=$?
else
version_output=$("$ai_cli" version 2>&1) || version_exit=$?
fi
else
# claude CLI
if [[ -n "$timeout_cmd" ]]; then
version_output=$("$timeout_cmd" 10 "$ai_cli" --version 2>&1) || version_exit=$?
else
version_output=$("$ai_cli" --version 2>&1) || version_exit=$?
fi
fi

# If version command succeeded (exit 0) or produced output, CLI is working
if [[ "$version_exit" -eq 0 ]] || [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]]; then
# Cache the healthy result
date +%s >"$cli_cache_file" 2>/dev/null || true
_PULSE_CLI_VERIFIED="true"
log_info "CLI health: OK ($ai_cli: ${version_output:0:80})"
return 0
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

version_exit=1 initialization makes the exit_code==0 branch unreachable and enables a false-positive health result.

The || version_exit=$? idiom only assigns when the command exits non-zero (short-circuit evaluation). Because version_exit is initialized to 1, if opencode version / claude --version exits successfully (code 0), version_exit stays at 1 — the "$version_exit" -eq 0 guard on Line 1589 is permanently dead code.

More importantly this creates a false positive: a binary that exits non-zero (broken installation, startup crash) but still emits any output to stderr/stdout (which 2>&1 captures) will satisfy the second condition [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]] and be reported as healthy — exactly the class of environment failure this check is designed to catch.

🐛 Proposed fix — separate assignment from capture
-	local version_output=""
-	local version_exit=1
+	local version_output=""
+	local version_exit=0

 	if [[ "$ai_cli" == "opencode" ]]; then
 		if [[ -n "$timeout_cmd" ]]; then
-			version_output=$("$timeout_cmd" 10 "$ai_cli" version 2>&1) || version_exit=$?
+			version_output=$("$timeout_cmd" 10 "$ai_cli" version 2>&1)
+			version_exit=$?
 		else
-			version_output=$("$ai_cli" version 2>&1) || version_exit=$?
+			version_output=$("$ai_cli" version 2>&1)
+			version_exit=$?
 		fi
 	else
 		if [[ -n "$timeout_cmd" ]]; then
-			version_output=$("$timeout_cmd" 10 "$ai_cli" --version 2>&1) || version_exit=$?
+			version_output=$("$timeout_cmd" 10 "$ai_cli" --version 2>&1)
+			version_exit=$?
 		else
-			version_output=$("$ai_cli" --version 2>&1) || version_exit=$?
+			version_output=$("$ai_cli" --version 2>&1)
+			version_exit=$?
 		fi
 	fi

With this change, a clean exit 0 (even with empty output) is treated as healthy via the first condition, and a non-zero exit is correctly treated as a failure unless there is output (which preserves the "output-even-on-non-zero" lenient path intentionally).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/dispatch.sh around lines 1562 - 1595, The
health-check incorrectly initializes version_exit=1 and uses the "cmd ||
version_exit=$?" idiom, causing successful commands to leave version_exit at 1
and producing false positives; change each invocation that currently does
version_output=$(... ) || version_exit=$? to first capture output then
immediately set version_exit=$? (e.g., version_output=$(... 2>&1);
version_exit=$?) for all branches that run the CLI (the blocks referencing
timeout_cmd, ai_cli, and the "opencode" vs "claude CLI" branches), leaving the
rest of the logic (the check using version_exit, version_output, cli_cache_file,
_PULSE_CLI_VERIFIED, and log_info) unchanged.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a valuable pre-dispatch health check for the AI CLI, improving the overall reliability of the supervisor by preventing wasted retries on environment-related issues. The classification of failures into a new ENVIRONMENT category is a sound architectural decision. However, there are some technical issues in the health check implementation, particularly regarding arithmetic safety with potentially empty variables and incorrect exit code handling due to initialization values.

Comment on lines +1541 to +1545
local cached_at
cached_at=$(cat "$cli_cache_file" 2>/dev/null || echo "0")
local now
now=$(date +%s)
local age=$((now - cached_at))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The arithmetic expansion at line 1545 is unsafe because cached_at can be empty if the cache file exists but is empty (in which case cat succeeds and the || echo "0" fallback is not triggered). This would cause a syntax error in the arithmetic expansion. Additionally, following the repository style guide (line 11), declarations and assignments for expansions that can fail should be separate to ensure exit code safety.

Suggested change
local cached_at
cached_at=$(cat "$cli_cache_file" 2>/dev/null || echo "0")
local now
now=$(date +%s)
local age=$((now - cached_at))
local cached_at
cached_at=$(cat "$cli_cache_file" 2>/dev/null || echo "0")
local now
now=$(date +%s)
local age
age=$((now - ${cached_at:-0}))
References
  1. Use local var="$1" pattern in functions (declare and assign separately for exit code safety) (link)
  2. In shell scripts with set -e, var=$(cat file || echo 0) is not safe for arithmetic if the file can be empty. cat on an empty file succeeds, making var an empty string and causing arithmetic expansion to fail. Use parameter expansion with a default value (e.g., $((... - ${var:-0}))) to handle empty strings robustly.


# Check 2: binary is executable and can produce version output
local version_output=""
local version_exit=1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Initializing version_exit to 1 causes a logic error in the health check. When the version command succeeds (exit 0), the || version_exit=$? part is skipped, leaving version_exit at its initial value of 1. This causes the first condition of the health check at line 1589 ([[ "$version_exit" -eq 0 ]]) to fail even on success, forcing the check to rely on the more lenient output-based condition. It should be initialized to 0.

Suggested change
local version_exit=1
local version_exit=0

fi

# If version command succeeded (exit 0) or produced output, CLI is working
if [[ "$version_exit" -eq 0 ]] || [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]]; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The health check logic is currently too lenient. If the version command fails (e.g., exit 1) but still produces some output (like an error message), it will be considered 'healthy' because of the [[ -n "$version_output" ... ]] condition. Given that the goal is to detect broken installations, a non-zero exit code should generally be treated as a failure unless there's a specific reason to trust the output of a failed command.

Suggested change
if [[ "$version_exit" -eq 0 ]] || [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]]; then
if [[ "$version_exit" -eq 0 ]]; then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant