Skip to content

fix(task): prevent hang when nested tasks fail#6430

Merged
jdx merged 1 commit intomainfrom
fix/task-hang-on-failure
Sep 26, 2025
Merged

fix(task): prevent hang when nested tasks fail#6430
jdx merged 1 commit intomainfrom
fix/task-hang-on-failure

Conversation

@jdx
Copy link
Owner

@jdx jdx commented Sep 26, 2025

Summary

Fixes task execution hanging when nested or dependent tasks fail.

Problem

When a task failed within a sequence of tasks (especially nested or dependent tasks), mise would hang indefinitely instead of properly exiting with an error. This was reported in #6391.

Root Cause

The inject_and_wait function was waiting for a completion signal from the dependency graph that would never come. When a task failed:

  1. The scheduler would stop spawning new dependent tasks
  2. The dependency graph would never become empty (dependent tasks remained in the graph)
  3. The done signal was never sent because the graph wasn't empty
  4. The function would hang forever waiting for the signal

Solution

Modified the inject_and_wait function to:

  1. Periodically check if task execution should stop due to failure (every 100ms)
  2. When a failure is detected and continue_on_error is false, clean up the dependency graph by removing all remaining tasks
  3. This triggers the completion signal and allows the function to exit properly with an error

Test plan

  • Added e2e test test_task_failure_hang that reproduces the issue
  • Test verifies that mise exits with error code 1 instead of hanging when a dependency fails
  • All existing task tests pass

Fixes: #6391

🤖 Generated with Claude Code

When a task failed within a sequence of tasks (especially nested or dependent tasks),
mise would hang indefinitely instead of properly exiting with an error. This was
caused by the dependency graph never becoming empty when dependent tasks were not
executed due to an earlier failure.

The fix modifies inject_and_wait() to periodically check for failures and clean up
the dependency graph when stopping early, ensuring proper termination.

Fixes: #6391

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings September 26, 2025 10:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Fixes a hanging issue where mise would wait indefinitely when nested or dependent tasks fail, instead of properly exiting with an error. The fix implements periodic checking for task failures and proper cleanup of the dependency graph.

  • Replaced blocking wait with a polling loop that checks for early failure conditions every 100ms
  • Added dependency graph cleanup when failures occur and continue_on_error is false
  • Added e2e test to verify the fix prevents hanging behavior

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/cli/run.rs Modified inject_and_wait function to use polling with timeout instead of blocking wait, added failure detection and dependency graph cleanup
e2e/tasks/test_task_failure_hang Added new e2e test that verifies mise exits properly when dependent tasks fail instead of hanging

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +897 to +901
let tasks_to_remove: Vec<Task> = deps.all().cloned().collect();
for task in tasks_to_remove {
deps.remove(&task);
}
drop(deps);
Copy link

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup code holds the lock while collecting all tasks and then iterates through them. Consider clearing the dependency graph more efficiently with a single operation like deps.clear() if available, or release the lock between collection and removal to reduce lock contention.

Suggested change
let tasks_to_remove: Vec<Task> = deps.all().cloned().collect();
for task in tasks_to_remove {
deps.remove(&task);
}
drop(deps);
deps.clear();
// If deps.clear() is not available, fallback to the original method:
// let tasks_to_remove: Vec<Task> = deps.all().cloned().collect();
// drop(deps);
// let mut deps = sub_deps.lock().await;
// for task in tasks_to_remove {
// deps.remove(&task);
// }
// drop(deps);

Copilot uses AI. Check for mistakes.
}
drop(deps);
// Give a short time for the spawned task to finish cleanly
let _ = tokio::time::timeout(Duration::from_millis(100), done_rx).await;
Copy link

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout duration (100ms) is hardcoded and appears twice in this function. Consider extracting this as a named constant to improve maintainability and consistency.

Copilot uses AI. Check for mistakes.
@jdx jdx enabled auto-merge (squash) September 26, 2025 10:23
@jdx jdx disabled auto-merge September 26, 2025 10:23
@jdx jdx enabled auto-merge (squash) September 26, 2025 10:28
@jdx jdx merged commit 5024a12 into main Sep 26, 2025
19 checks passed
@jdx jdx deleted the fix/task-hang-on-failure branch September 26, 2025 10:33
@github-actions
Copy link

Hyperfine Performance

mise x -- echo

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2025.9.19 x -- echo 20.6 ± 0.3 19.7 22.9 1.00
mise x -- echo 20.9 ± 0.3 20.1 22.1 1.02 ± 0.02

mise env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2025.9.19 env 19.8 ± 0.5 19.2 26.6 1.00
mise env 20.3 ± 0.6 19.4 26.9 1.02 ± 0.04

mise hook-env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2025.9.19 hook-env 19.5 ± 0.2 18.9 20.2 1.00
mise hook-env 20.0 ± 0.3 19.1 21.2 1.02 ± 0.02

mise ls

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2025.9.19 ls 17.4 ± 0.3 16.8 18.4 1.00
mise ls 17.8 ± 0.3 17.1 18.9 1.02 ± 0.02

xtasks/test/perf

Command mise-2025.9.19 mise Variance
install (cached) 171ms ✅ 105ms +62%
ls (cached) 65ms 65ms +0%
bin-paths (cached) 71ms 71ms +0%
task-ls (cached) 504ms 500ms +0%

✅ Performance improvement: install cached is 62%

@jdx jdx mentioned this pull request Sep 26, 2025
jdx added a commit to stempler/mise that referenced this pull request Sep 30, 2025
Extends the fix from jdx#6430 to handle the main scheduler loop. When a
task failure occurs and continue_on_error is false, the scheduler now
properly cleans up the main dependency graph before exiting.

## Problem
The previous fix in jdx#6430 addressed hangs in nested task sequences
(inject_and_wait), but didn't handle the main scheduler loop. When
multiple tasks depended on a common failing dependency, the scheduler
would stop accepting new work but wouldn't clean up the dependency
graph, causing it to wait indefinitely for a completion signal that
would never arrive.

## Root Cause
In the main scheduler loop (lines 413-492):
1. When a dependency fails, is_stopping() becomes true
2. The scheduler breaks out of the inner drain loop
3. However, remaining tasks stay in main_deps
4. The spawned async task waits for deps to be empty before signaling
5. This creates a deadlock - scheduler stopped but deps not cleared

## Solution
Added cleanup logic similar to inject_and_wait: when is_stopping() is
detected, explicitly remove all remaining tasks from main_deps. This
triggers the completion signal and allows proper shutdown.

Fixes the hang demonstrated in test_task_failure_hang_depends

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
jdx added a commit that referenced this pull request Sep 30, 2025
…6481)

## Summary
Fixes task execution hanging when a task depends on multiple tasks that
share a common failing dependency. Extends the fix from #6430 to handle
the main scheduler loop.

## Problem
When multiple tasks depend on a common dependency that fails (e.g.,
`lint`, `test`, and `build:docker` all depend on `install`, which
fails), and another task depends on all of them (e.g., `check` depends
on all three), the scheduler would hang indefinitely instead of properly
exiting with an error.

This scenario wasn't covered by the previous fix in #6430, which only
handled nested task sequences within `inject_and_wait`.

## Root Cause
In the main scheduler loop:
1. When a dependency fails, `is_stopping()` becomes true
2. The scheduler stops accepting new tasks from the queue
3. However, remaining tasks stay in the `main_deps` dependency graph
4. The spawned async task waits for `main_deps` to become empty before
sending the `main_done` signal
5. This creates a deadlock - the scheduler has stopped but the deps
graph is never cleared

## Solution
Added cleanup logic similar to `inject_and_wait`: when `is_stopping()`
is detected in the main scheduler loop, explicitly remove all remaining
tasks from `main_deps`. This triggers the completion signal and allows
proper shutdown.

## Test plan
- Added e2e test `test_task_failure_hang_depends` that reproduces the
complex dependency scenario
- Test verifies that mise exits with error code 1 instead of hanging
when dependencies fail
- Existing test `test_task_failure_hang` continues to pass
- All task-related e2e tests pass

Fixes: #6391

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: jdx <216188+jdx@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants