Skip to content

fix(workflow): prompt user on resume of failed run + allow abandoning failed + add --force flag#1551

Open
ztech-gthb wants to merge 4 commits intocoleam00:devfrom
ztech-gthb:fix/workflow-resume-prompts-user
Open

fix(workflow): prompt user on resume of failed run + allow abandoning failed + add --force flag#1551
ztech-gthb wants to merge 4 commits intocoleam00:devfrom
ztech-gthb:fix/workflow-resume-prompts-user

Conversation

@ztech-gthb
Copy link
Copy Markdown
Contributor

@ztech-gthb ztech-gthb commented May 3, 2026

Summary

  • Problem: /workflow run X "task B" silently auto-resumes a prior failed run of X in the same chat, executing in the failed run's sub-worktree with the failed run's persisted user_message (= "task A"). The new prompt is discarded with no UI/log indication.
  • Why it matters: confident-but-wrong reports of work the user didn't ask for. Trust-corroding silent data loss; users debug nonexistent problems while the assistant runs the same stale task repeatedly.
  • What changed: failed runs now emit a 3-option prompt (resume / abandon-then-rerun / --force) instead of silently auto-resuming. /workflow abandon accepts failed runs (transitions them to cancelled). New --force flag bypasses the resume-detection lookup for callers who explicitly want a fresh run while keeping the failed audit row.
  • What did not change: paused-run auto-resume (PR 🐛 UserReportedError: Manual bug report #914 approval-gate path) is unchanged; /workflow resume <id> still accepts failed runs; DB schema unchanged; no migrations.

UX Journey

Before

User                          Archon orchestrator           DB
────                          ───────────────────           ──
runs /workflow run X "A" ───▶ findResumable... → null
                              dispatch fresh run        ───▶ run-A row (failed)

runs /workflow run X "B" ───▶ findResumable... → run-A
                              [silent foreground resume]
                              executeWorkflow(working_path=
                                  thread-of-run-A,
                                  userMessage="B"
                                  ↑ ignored: scripts read
                                  $ARTIFACTS_DIR/.X persisted
                                  from run-A)             ───▶ "completed" on
                                                                task A again

sees positive report ◀──── reports task A as success
   (still confused about
    why task B didn't happen)

After

User                          Archon orchestrator           DB
────                          ───────────────────           ──
runs /workflow run X "A" ───▶ findResumable... → null
                              dispatch fresh run        ───▶ run-A row (failed)

runs /workflow run X "B" ───▶ findResumable... → run-A
                              status === 'paused'?
                                no → emit 3-option prompt    [no new run]
                                     **return** (no dispatch)
sees prompt with: ◀────
  /workflow resume <id>           [option 1]
  /workflow abandon <id>          [option 2: works on failed now]
   then re-run command
  /workflow run X --force "B"     [option 3: keep run-A as-is]

Architecture Diagram

Before

                   ┌─ command-handler.ts (parses /workflow run)
                   │     │ args.slice(2).join(' ') = "..."
                   │     ▼
                   │  CommandResult.workflow = { definition, args }
                   │     │
orchestrator-      │     ▼
agent.ts ──────────┴─ handleWorkflowRunCommand
                         │
                         ▼
                      dispatchOrchestratorWorkflow
                         │
                         ▼
                      findResumableRunByParentConversation
                      (status IN ['failed','paused'])
                         │
                         ▼
                      resumableRun? → executeWorkflow [silent]
                                     (auto-resume regardless of status)


workflow-operations.ts: abandonWorkflow rejects ALL terminal statuses
  (= completed | failed | cancelled) — failed runs un-abandonable

After

                   ┌─ command-handler.ts (parses /workflow run)
                   │     │ NEW: also parses --force from args
                   │     ▼
                   │  CommandResult.workflow = { definition, args, force? } [~]
                   │     │
orchestrator-      │     ▼
agent.ts ──────────┴─ handleWorkflowRunCommand(..., options={force}) [~]
                         │
                         ▼
                      dispatchOrchestratorWorkflow(..., options) [~]
                         │
                         ▼
                      options.force? null : findResumableRun(...) [+]
                         │
                         ▼
                      resumableRun?
                         ├─ status==='paused' → executeWorkflow [unchanged]
                         └─ else (failed/...)
                              → platform.sendMessage(3-option prompt) [+]
                              → return [+]


workflow-operations.ts: abandonWorkflow rejects only completed | cancelled [~]
  (failed → cancelled is the user's discard action; row stays in DB)

Connection inventory:

From To Status Notes
command-handler.ts:case 'run' args.findIndex('--force') new flag parsing
command-handler.ts CommandResult.workflow.force new type-level pass-through
orchestrator-agent.ts:handleWorkflowRunCommand dispatchOrchestratorWorkflow modified new options param
dispatchOrchestratorWorkflow findResumableRunByParentConversation modified guarded by options.force
dispatchOrchestratorWorkflow platform.sendMessage (3-option prompt) new failed-run user prompt
dispatchOrchestratorWorkflow executeWorkflow (paused branch) unchanged PR #914 path preserved
abandonWorkflow TERMINAL_WORKFLOW_STATUSES check removed replaced with explicit completed/cancelled

Label Snapshot

  • Risk: risk: low
  • Size: size: M
  • Scope: core
  • Module: core:orchestrator, core:operations, core:db

Change Metadata

  • Change type: bug
  • Primary scope: core

Linked Issue

Validation Evidence (required)

bun run type-check    # clean across all 10 packages
bun --filter @archon/core test
# 102 pass / 0 fail
# - foreground_resume_detected: existing test, fixture switched 'failed' → 'paused'
# - failed_resume_user_prompted: new (asserts no executeWorkflow, prompt text contains all 3 commands)
# - --force flag: new (asserts findResumable not consulted, dispatchBackground fires)

bun run lint          # clean for all touched files
bun run format:check  # clean

End-to-end manual verification on a real chat thread (2026-05-03):

  1. Existing failed run 32c786ef… from prior session stayed failed in DB.
  2. /workflow run ztech-marimo-edit "..." produced the 3-option prompt. No silent dispatch, no executeWorkflow call observed in workflow logs.
  3. /workflow run ztech-marimo-edit --force "..." dispatched fresh, completed success: true with new feature branch (wf/marimo-edit-1777809446). Original 32c786ef… row untouched.
  4. /workflow abandon 32c786ef… transitioned the row failed → cancelled, completed_at populated. Subsequent /workflow run in the same chat no longer triggered the prompt.

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

The prompt text is constructed via plain string concatenation; the userMessage's embedded quotes are escaped (\") when interpolated into the suggested re-run command block. No new untrusted-input parsing path.

Compatibility / Migration

  • Backward compatible? Yes for the paused-run auto-resume path (PR 🐛 UserReportedError: Manual bug report #914), /workflow resume <id> for failed runs, /workflow run callers without --force. The only behavior change is: failed-run-on-fresh-/workflow run now prompts instead of silently auto-resuming — the prior behavior was silent data loss, so this is the intended fix, not a regression.
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

Verified scenarios:

  • Failed-run prompt fires for status='failed' resumable, three commands appear with correct IDs interpolated.
  • --force skips the resume lookup entirely (verified via mockFindResumableRunByParentConversation.not.toHaveBeenCalled() in test, plus end-to-end real run).
  • /workflow abandon on a failed run transitions to cancelled (DB confirmed).
  • After abandon, /workflow run in the same conversation no longer prompts (cancelled is excluded from the resume lookup).
  • paused-status auto-resume continues to work in the existing test fixture.

Edge cases checked:

  • --force token recognized at any position in args (not just immediately after workflow name).
  • userMessage with embedded " is escaped in the option-3 suggested command block.

What was not verified:

  • A workflow that legitimately wants to retry a transient failed run silently (if such a case exists, this PR makes it require either /workflow resume <id> or --force). No such workflow has been identified in the default workflows.

Side Effects / Blast Radius (required)

  • Affected subsystems: core:orchestrator dispatch path, core:operations abandon validation, core:handlers workflow command parsing, types.
  • Potential unintended effects: a workflow author who depended on the silent failed-run auto-resume as a "retry on flaky steps" mechanism will see the new prompt. The prompt's first option (/workflow resume <id>) preserves that capability with one extra step.
  • Guardrails: the new orchestrator.failed_resume_user_prompted log event lets operators detect users hitting this path. The existing orchestrator.foreground_resume_detected log fires only on the (unchanged) paused branch.

Rollback Plan (required)

  • Fast rollback: revert this commit. Behavior returns to silent auto-resume of failed runs and abandon-rejection. No data migration needed.
  • Feature flags: none. The behavior change is unconditional but additive (--force is opt-in for the override case, and the prompt itself emits one user-visible message).
  • Observable failure symptoms if rollback is needed: users complaining the prompt is too noisy (would suggest making the prompt suppressible per-workflow), or workflows expecting silent retry-on-failure breaking (would suggest a per-workflow auto_resume_failed flag). Neither has been observed in testing.

Risks and Mitigations

  • Risk: a workflow author relied on silent auto-resume of failed runs as a retry mechanism.
    • Mitigation: the prompt's first option offers /workflow resume <id> as a copy-pasteable command; the retry capability is preserved with one extra user step.
  • Risk: --force is recognized as a token anywhere in args. A user passing --force as part of literal description text ("... change the --force flag handling ...") could accidentally trigger force.
    • Mitigation: known pattern in CLI tools without -- separator support. If literal collision becomes a real problem, a -- separator could be added in a follow-up. Not a regression: the flag didn't exist before, no existing input pattern is broken.
  • Risk: hand-maintained 3-option prompt drifts from the actual command-handler behavior over time.
    • Mitigation: the new test (failed_resume_user_prompted) asserts the three command strings are present in the prompt; CI catches accidental removal. A future PR could derive the prompt from command-handler metadata, but is premature given the small surface area.

Summary by CodeRabbit

  • New Features

    • Global --force option to start a fresh workflow run, skipping resume detection.
    • /workflow resume <id> returns workflow details to trigger foreground resumption when appropriate.
  • Bug Fixes

    • Paused runs auto-resume; failed runs prompt users with clear choices (resume, abandon+retry, or start fresh).
    • Abandon now accepts running, paused, and failed runs.
    • Clear error shown when a resumed workflow definition is missing.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fecbb006-f93a-4a57-b3cc-9ca39467b820

📥 Commits

Reviewing files that changed from the base of the PR and between f1168a0 and 332426e.

📒 Files selected for processing (2)
  • packages/core/src/handlers/command-handler.ts
  • packages/core/src/orchestrator/orchestrator-agent.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/core/src/handlers/command-handler.ts
  • packages/core/src/orchestrator/orchestrator-agent.ts

📝 Walkthrough

Walkthrough

This PR prevents silent auto-resume of failed runs: /workflow run can be forced with --force to skip resume detection; only paused runs auto-resume; failed runs prompt the user (or can be explicitly resumed via resumeRunId); abandonWorkflow now allows abandoning failed runs. Command parsing, types, orchestration, and tests updated.

Changes

Auto-resume hijack fix (single cohort)

Layer / File(s) Summary
Data Shape
packages/core/src/types/index.ts
CommandResult.workflow adds optional force?: boolean and resumeRunId?: string.
Command Parsing / CLI
packages/core/src/handlers/command-handler.ts
/workflow run extracts --force positionally (first bare arg), removes it from workflow args, and returns workflow.force; /workflow resume <id> reloads discovered workflows from the resumed run's cwd, validates the workflow still exists, and returns a workflow payload with definition, args (from run.user_message ?? '') and resumeRunId (or explicit error on discovery/missing definition).
Core Orchestration Logic
packages/core/src/orchestrator/orchestrator-agent.ts
dispatchOrchestratorWorkflow(options?: {force?: boolean; resumeRunId?: string}) added. On web path, skips resumable-run lookup when options.force is true; otherwise may find a prior run and auto-resume only for paused runs or when resumableRun.id === options.resumeRunId. For other prior runs (e.g., failed) it stops and prompts the user with resume/abandon/force choices. Natural-language approval resume calls dispatchOrchestratorWorkflow(..., { resumeRunId }). Deterministic /workflow handling forwards force and resumeRunId into handleWorkflowRunCommand.
Operations Layer
packages/core/src/operations/workflow-operations.ts
abandonWorkflow() now rejects only completed and cancelled statuses; it allows abandoning failed, running, and paused runs. Removed TERMINAL_WORKFLOW_STATUSES usage.
DB Docs
packages/core/src/db/workflows.ts
Expanded doc comment for findResumableRunByParentConversation to document orchestrator semantics: paused runs auto-resume; failed runs cause a prompt to avoid reusing stale persisted user_message.
Tests
packages/core/src/orchestrator/orchestrator-agent.test.ts, packages/core/src/handlers/command-handler.test.ts
Reset mockFindResumableRunByParentConversation in beforeEach. Updated foreground-resume test to use status: 'paused'. Added tests covering failed-run prompt, explicit resume via workflow.resumeRunId, and --force bypass; updated /workflow resume tests to assert returned workflow.resumeRunId and discovery behavior.

Sequence Diagram

sequenceDiagram
    actor User
    participant Handler as Command Handler
    participant Orchestrator as Orchestrator Agent
    participant DB as Database
    participant Executor as Workflow Executor

    rect rgba(0,150,0,0.5)
    Note over User,Executor: Forced fresh run (skip resume lookup)
    User->>Handler: /workflow run my-workflow "task B" --force
    Handler->>Handler: extract --force, clean args
    Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:true})
    Orchestrator->>Orchestrator: skip findResumableRunByParentConversation
    Orchestrator->>Executor: dispatchBackgroundWorkflow(fresh run)
    Executor-->>Orchestrator: run dispatched
    end

    rect rgba(100,150,255,0.5)
    Note over User,Executor: Paused run auto-resume
    User->>Handler: /workflow run my-workflow "continue"
    Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:false})
    Orchestrator->>DB: findResumableRunByParentConversation()
    DB-->>Orchestrator: {status:'paused', working_path:'...'}
    Orchestrator->>Executor: executeWorkflow(working_path from prior run)
    Executor-->>Orchestrator: run resumed
    end

    rect rgba(255,100,100,0.5)
    Note over User,Orchestrator: Failed run → user prompt
    User->>Handler: /workflow run my-workflow "task B"
    Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:false})
    Orchestrator->>DB: findResumableRunByParentConversation()
    DB-->>Orchestrator: {status:'failed', id:'abc123'}
    Orchestrator->>User: prompt (resume abc123 / abandon+rerun / start fresh --force)
    User-->>Handler: selects action
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • coleam00/Archon#1329: Overlapping changes to orchestrator resume logic and tests.
  • #1065: Related — changes to handleMessage/handleWorkflowRunCommand flows that now propagate force/resumeRunId.
  • #? (omitted)

Poem

🐰 I hopped through worktrees, found a paused old trail,

Failed runs now knock before they take the sail.
--force clears the path, fresh starts led by light,
Abandon hears the failed, resumes when named just right. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately summarizes the three main changes: failed-run prompt, abandoning failed runs, and --force flag addition.
Description check ✅ Passed Description follows template structure with all required sections: Summary, UX Journey, Architecture Diagram, Labels, Validation, Security, Compatibility, Verification, and Risks.
Linked Issues check ✅ Passed All coding requirements from #1549 are met: failed runs prompt instead of auto-resume, --force flag skips lookup, /workflow abandon accepts failed, paused auto-resume preserved.
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objectives: orchestrator dispatch logic, workflow-operations abandon validation, command-handler parsing, and type signatures for workflow control options.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/core/src/operations/workflow-operations.ts (1)

112-127: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Return the post-cancel state from abandonWorkflow.

This still returns the pre-update row, so callers that read run.status will continue to see failed/paused after a successful abandon. Now that failed runs are intentionally abandonable, that stale status is much easier to surface in UI/metadata.

Suggested fix
 export async function abandonWorkflow(runId: string): Promise<WorkflowRun> {
   const run = await getRunOrThrow(runId, 'operations.workflow_abandon_lookup_failed');
   if (run.status === 'completed' || run.status === 'cancelled') {
     throw new Error(`Cannot abandon run with status '${run.status}'. Run is already terminal.`);
   }
   try {
     await workflowDb.cancelWorkflowRun(runId);
   } catch (error) {
     const err = error as Error;
     getLog().error(
       { err, errorType: err.constructor.name, runId },
       'operations.workflow_abandon_failed'
     );
     throw new Error(`Failed to abandon workflow run ${runId}: ${err.message}`);
   }
-  return run;
+  return (await workflowDb.getWorkflowRun(runId)) ?? { ...run, status: 'cancelled' };
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/operations/workflow-operations.ts` around lines 112 - 127,
abandonWorkflow currently returns the pre-cancel WorkflowRun, so update it to
return the post-cancel state: after awaiting workflowDb.cancelWorkflowRun(runId)
call get the updated run (e.g. via getRunOrThrow(runId,
'operations.workflow_abandon_lookup_failed') or the appropriate workflowDb fetch
method) and return that WorkflowRun instead of the original `run`; keep the
existing error logging around workflowDb.cancelWorkflowRun and rethrow as
before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/core/src/orchestrator/orchestrator-agent.ts`:
- Around line 281-348: The code is prompting users when a prior run.status ===
'failed' even for explicit resume attempts because resumeWorkflow() doesn't mark
the intent; update the orchestration logic to detect an explicit resume request
(e.g., accept a resumeRunId or resume flag passed into the entry point that
calls orchestrator-agent) and treat resumableRun as allowed to auto-resume when
resumableRun.id === resumeRunId (or when a resume=true flag is present), so the
branch that currently shows prompt will instead call executeWorkflow(...) for
that specific resumableRun; wire the resumeRunId from the /workflow resume <id>
handler (or make that handler dispatch directly to executeWorkflow) so
resumeWorkflow() sets that identifier before re-entering the resumable-run
check.

---

Outside diff comments:
In `@packages/core/src/operations/workflow-operations.ts`:
- Around line 112-127: abandonWorkflow currently returns the pre-cancel
WorkflowRun, so update it to return the post-cancel state: after awaiting
workflowDb.cancelWorkflowRun(runId) call get the updated run (e.g. via
getRunOrThrow(runId, 'operations.workflow_abandon_lookup_failed') or the
appropriate workflowDb fetch method) and return that WorkflowRun instead of the
original `run`; keep the existing error logging around
workflowDb.cancelWorkflowRun and rethrow as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 13320894-c3d7-4717-a5cc-d4a2e25a008f

📥 Commits

Reviewing files that changed from the base of the PR and between 69b2c89 and 599209c.

📒 Files selected for processing (6)
  • packages/core/src/db/workflows.ts
  • packages/core/src/handlers/command-handler.ts
  • packages/core/src/operations/workflow-operations.ts
  • packages/core/src/orchestrator/orchestrator-agent.test.ts
  • packages/core/src/orchestrator/orchestrator-agent.ts
  • packages/core/src/types/index.ts

Comment thread packages/core/src/orchestrator/orchestrator-agent.ts
…V2a prompt for explicit resume (resumeRunId option)
@ztech-gthb
Copy link
Copy Markdown
Contributor Author

Good catch — fixed in latest commit. Added options.resumeRunId to dispatchOrchestratorWorkflow; the V2a check is now status === 'paused' || resumableRun.id === options?.resumeRunId. Two callers wire it: /workflow resume <id> now dispatches directly (no more 'run same workflow again' instruction that was looping users through the prompt) and the natural-language approval handler passes the paused run's id when it transitions status to 'failed' before dispatching. New regression test: resumeRunId option: failed run DOES auto-resume when CommandResult.workflow.resumeRunId matches. 103/103 in orchestrator-agent.test.ts, type-check + lint + format clean.

@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented May 4, 2026

@ztech-gthb related to #1549 — workflow resume prompt fix.

@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented May 4, 2026

Thanks @ztech-gthb — the architectural primitives here are right. The three-option failure prompt captures the actual decision space, and the force / resumeRunId distinction in the orchestrator avoids the prompt loop cleanly. The doc comment on abandonWorkflow is exemplary.

Three items worth tightening before merge:

1. --force parsing matches the flag anywhere in args (packages/core/src/handlers/command-handler.ts:817-828)

const forceIndex = restArgs.findIndex(a => a === '--force');

findIndex matches --force anywhere in the user's args. If someone runs:

/workflow run deploy "deploy --force to staging"

…the parser strips --force from inside the quoted message, sets force=true, and passes deploy to staging as args. Silent corruption of the user's input.

Suggested fix: only strip --force when it's a bareword in a flag-position (e.g., immediately after the workflow name, or at the end before the quoted message), not as a substring of the quoted args. Or use -- as a verbatim-args separator. The existing "be lenient" comment hints at the intent, but lenience here is unsafe.

2. Markdown escaping in the failure prompt is incomplete (packages/core/src/orchestrator/orchestrator-agent.ts)

const escapedMsg = userMessage.replace(/"/g, '\\"');

Only " is escaped. If the user's original message contains a backtick, the markdown code-blocks in the prompt close early and the rest renders as plain text. If it contains \, that won't be re-escaped either. Not a security issue, but a UX paper-cut whenever the user's first attempt had backticks in it.

Suggested fix: also escape backticks and backslashes, or render the failed-message as an indented code block (4-space) so backticks inside don't terminate it.

3. /workflow resume <id> mutates DB before checking the YAML exists (packages/core/src/handlers/command-handler.ts)

Order today:

  1. resumeWorkflow(runId) — transitions DB row to running
  2. discoverWorkflowsWithConfig(...) — load YAML
  3. If YAML missing, error

If step 3 fails, the DB is in running state but no execution happens. The error message ("Restore the YAML and try again") is fine UX-wise, but the run sits in a half-resumed state until the user retries.

Suggested fix: discover the workflow YAML before calling resumeWorkflow, or roll the DB back when discovery fails.


CI is green and the rest of the change looks solid. Happy to merge once these three are addressed (or to merge as-is and track them as a follow-up issue if you'd rather move on — your call).

…ks/backslashes in failed-run prompt (review feedback)
@ztech-gthb
Copy link
Copy Markdown
Contributor Author

Thanks @Wirasm — addressed in 332426e:

1. --force flag-position parsing (packages/core/src/handlers/command-handler.ts)
Tightened to restArgs[0] === '--force'. Worth noting: parseCommand already splits quoted strings into a single arg before the handler sees them, so the /workflow run deploy "deploy --force to staging" example resolves to args = ['run', 'deploy', 'deploy --force to staging']findIndex would have returned -1 there. But the principle is right: lenient matching invites surprises if parseCommand is ever changed. Strict position-0 match is the safer contract.

2. Markdown escaping (packages/core/src/orchestrator/orchestrator-agent.ts)
Extended to [\\\\\"\`] so backticks and backslashes are also escaped. The triple-backtick code-fence in the failed-run prompt is now safe against arbitrary userMessage content.

3. /workflow resume <id> — DB mutation before YAML check
I read this twice and I believe it's a non-issue. resumeWorkflow in packages/core/src/operations/workflow-operations.ts is a pure validator:

```ts
export async function resumeWorkflow(runId: string): Promise {
const run = await getRunOrThrow(runId, 'operations.workflow_resume_lookup_failed');
if (!RESUMABLE_WORKFLOW_STATUSES.includes(run.status)) {
throw new Error(...);
}
return run;
}
```

No status transition, no `runs` row update, no event insert. The actual resume happens later in `dispatchOrchestratorWorkflow` once the workflow YAML has been resolved and validated. The function name is admittedly misleading; happy to rename to `validateResumable` (or similar) in a follow-up if you'd like — but no half-resumed-state risk in current code. Let me know if I'm missing a side-effect somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests

2 participants