Skip to content

fix(automation): create workspace + launch agent in a single relay call (SUPER-783)#4936

Draft
justincrich wants to merge 1 commit into
superset-sh:mainfrom
justincrich:improvement/SUPER-783-automation-new-workspace-target
Draft

fix(automation): create workspace + launch agent in a single relay call (SUPER-783)#4936
justincrich wants to merge 1 commit into
superset-sh:mainfrom
justincrich:improvement/SUPER-783-automation-new-workspace-target

Conversation

@justincrich
Copy link
Copy Markdown
Contributor

@justincrich justincrich commented May 26, 2026

Summary

Closes SUPER-783. "New workspace" automations (the default for scheduled agent runs) failed silently — the workspace would never get created and the agent would never start. User saw an opaque dispatch failure with no indication why.

Root cause

dispatchAutomation made two sequential relay round-trips for the new-workspace path:

  1. workspaces.create → returns the new workspace id
  2. agents.run → spawns the agent against that id

Each round-trip independently races the host's tunnel state. The first call would usually succeed (workspace gets created in cloud + host), but the second call would frequently land in a moment where the tunnel had just reconnected and the relay's in-memory routing didn't yet have the host — relay returned 503 "Host not connected" and the agent never spawned. The cloud wrote the run as dispatch_failed even though a workspace had actually been created on the host.

Fix

The host's workspaces.create handler already supports launching agents inline via the agents: [] field — see packages/host-service/src/trpc/router/workspaces/workspaces.ts:1025 (dispatchSugarAgents). The Superset SDK (packages/sdk/src/resources/workspaces.ts:41) already uses this. The dispatch path was the only caller that split create and launch into two trips.

This PR passes agents: [{ agent, prompt }] in the same workspaces.create call. One relay round-trip instead of two, no second tunnel race, atomic on the host side.

The pinned-workspace path (automation.v2WorkspaceId !== null) is unchanged — still a single agents.run call.

Diff

packages/trpc/src/router/automation/dispatch.ts | 84 ++++++++++++++++++-------
1 file changed, 60 insertions(+), 24 deletions(-)

Test plan

  • Schedule an automation with New workspace target on a host with a healthy tunnel → workspace gets created AND agent session starts (one automation_runs row with status='dispatched', v2_workspace_id populated, session_kind set)
  • Run history shows the new workspace + the agent session as a normal run, not dispatch_failed
  • Pinned-workspace automation still works (regression: automation.v2WorkspaceId !== null path unchanged)
  • Host-side agent-launch errors (e.g. project not set up locally) propagate as dispatch_failed with the host's error message instead of an opaque relay error

@capy-ai
Copy link
Copy Markdown

capy-ai Bot commented May 26, 2026

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

@stage-review
Copy link
Copy Markdown

stage-review Bot commented May 26, 2026

Ready to review this PR? Stage has broken it down into 2 individual chapters for you:

Title
1 Update workspace creation helper to include agents
2 Consolidate automation dispatch into single relay call
Open in Stage

Chapters generated by Stage for commit a47df6a on May 26, 2026 10:59pm UTC.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 926ed0b6-1220-4202-ae01-438367938c11

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements relay tunnel status tracking from hosts to the cloud and gates automation dispatch on relay connectivity. Hosts report tunnel state changes, the cloud API records timestamp-ordered updates, and the dispatch API defers, fails, or proceeds with automations based on the target host's relay status.

Changes

Relay Tunnel Status Tracking and Dispatch Validation

Layer / File(s) Summary
Relay status schema and types
packages/db/drizzle/0057_add_v2_hosts_relay_tunnel_status.sql, packages/db/drizzle/meta/_journal.json, packages/db/src/schema/schema.ts
Database migration adds relay_tunnel_status (default unknown) and relay_status_updated_at columns to v2_hosts. Schema exports RelayTunnelStatus type and relayTunnelStatusValues enum (unknown, disabled, connecting_initial, connecting_recovery, connected, disconnected).
Tunnel client state machine lifecycle
packages/host-service/src/tunnel/tunnel-client.ts
TunnelClient introduces a TunnelClientState state machine that distinguishes initial connect, recovery reconnect, successful connection, and disconnected states. It tracks recovery duration and emits disconnected once elapsed time exceeds RECOVERY_GIVEUP_MS, with deduplication and error handling around state change callbacks.
Host relay status reporting
packages/host-service/src/serve.ts, packages/host-service/src/tunnel/connect.ts
Host service reports relay status on startup (disabled when relay URL not set) and subscribes tunnel state changes via onStateChange callback. connectRelay computes the host routing key and wires the callback to publish state transitions to the cloud with best-effort error handling.
Cloud relay status mutation endpoint
packages/trpc/src/router/host/host.ts
New setRelayTunnelStatus mutation accepts hostId, status, and reportedAt, validates routing key and caller access, and conditionally updates v2Hosts.relayTunnelStatus and v2Hosts.relayStatusUpdatedAt only when reportedAt is newer than the stored timestamp (or when stored is null).
Dispatch relay validation and deferral
apps/api/src/app/api/automations/dispatch/[id]/route.ts, packages/trpc/src/router/automation/automation.ts, packages/trpc/src/router/automation/dispatch.ts
Dispatch route gates execution on target host relayTunnelStatus: disabled records a dispatch_failed run and returns relay_disabled; connecting_initial/connecting_recovery defers via QStash with an incremented attempt counter and per-attempt deduplication up to MAX_DEFER_ATTEMPTS, after which it records a timeout failure and returns deferred_timeout; other statuses proceed to normal dispatch. automationRouter.runNow also pre-checks and rejects disabled relay with BAD_REQUEST.

Sequence Diagram

sequenceDiagram
  participant Host as TunnelClient (host)
  participant CloudAPI as host.setRelayTunnelStatus
  participant DB as v2Hosts
  participant DispatchRoute
  participant QStash
  participant Dispatcher as dispatchAutomation

  Host->>CloudAPI: setRelayTunnelStatus(hostId, status, reportedAt)
  CloudAPI->>DB: conditional update relayTunnelStatus if reportedAt newer
  DispatchRoute->>DB: read relayTunnelStatus for targetHostId
  alt disabled
    DispatchRoute->>DB: insert dispatch_failed run
    DispatchRoute-->>DispatchRoute: return relay_disabled
  else connecting_*
    DispatchRoute->>QStash: publish deferred request (attempt+1, dedup)
  else
    DispatchRoute->>Dispatcher: dispatchAutomation(...)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • superset-sh/superset#4594: Overlaps on tunnel-client reconnection lifecycle adjustments; both modify reconnect behavior and state handling.
  • superset-sh/superset#4539: Also modifies tunnel-client connection lifecycle; related to connect-phase timing and cleanup logic.
  • superset-sh/superset#4734: Related changes in automations dispatch flow and resolveTargetHost / dispatchAutomation logic.

Poem

🐰 I hop and watch the tunnel hum,
tiny states that come and run—
a whisper to the cloud I send,
"Please wait, or fail, or try again."
Relay paths lead work back home.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title describes gating dispatch on relay tunnel readiness, but the raw_summary and PR description reveal the fix is more nuanced: it adds relay tunnel status tracking, conditional retry logic via QStash, and pre-checks—not a simple atomic gate. The title is partially related but oversimplifies the main change.
Description check ✅ Passed The PR description is comprehensive, covering root cause, fix strategy, and test plan. It follows the template with Summary (Related Issues via SUPER-783), Root cause explanation, and detailed Test plan sections. All critical context is present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR introduces v2_hosts.relay_tunnel_status as a first-class cloud signal to gate the automation dispatch worker on relay tunnel readiness, fixing silent failures where automations dispatched during cold boot or with relay disabled surface an opaque 502. The host-service emits state transitions (connecting_initial, connecting_recovery, connected, disconnected, disabled) via a new setRelayTunnelStatus tRPC mutation, and the dispatch route branches on those states to either proceed, defer via QStash (up to 10 × 30 s), or write an actionable error row.

  • Dispatch gate (route.ts): reads relay_tunnel_status before calling the relay for pinned-host automations and re-enqueues with a 30 s delay when the host is still connecting, or writes a clear settings-pointing error when relay is disabled.
  • TunnelClient state machine (tunnel-client.ts): tracks hasBeenOpen and recoveryStartedAt to distinguish initial vs. recovery connecting phases and escalates to disconnected after 5 min of unsuccessful reconnects.
  • runNow guard (automation.ts): pre-checks disabled status for pinned hosts and throws BAD_REQUEST before creating any run row, so the failure feels like inline validation rather than a spawned failure.

Confidence Score: 4/5

Safe to merge for the pinned-host path; the auto-routed path still has the original tunnel-readiness gap.

The tunnel-status gate works correctly when targetHostId is explicitly set. However, resolveTargetHost in dispatch.ts — used for all auto-routed automations (targetHostId = null) — selects relayTunnelStatus but never filters or branches on it, so a host with disabled or connecting_initial status can still be chosen and hit the relay, reproducing the exact opaque 502 this PR is meant to eliminate.

packages/trpc/src/router/automation/dispatch.ts — the auto-routing query needs to exclude hosts whose relay tunnel status would cause a dispatch failure.

Important Files Changed

Filename Overview
apps/api/src/app/api/automations/dispatch/[id]/route.ts Adds tunnel-status gate before relay calls for pinned-host automations; introduces QStash defer-and-retry loop (up to 10×30s). Logic is sound for the explicit-targetHostId path.
packages/trpc/src/router/automation/dispatch.ts Selects relayTunnelStatus in the auto-routing query but never acts on it — disabled/connecting hosts can still be picked for auto-routed automations, leaving the same 502 race for that path.
packages/host-service/src/tunnel/tunnel-client.ts New state machine with hasBeenOpen/recoveryStartedAt tracking. Duplicate emission suppression, recovery giveup at 5 min, and correct disconnected escalation all look solid.
packages/trpc/src/router/host/host.ts New setRelayTunnelStatus mutation with correct stale-write rejection, org membership verification, and routing-key parsing.
packages/trpc/src/router/automation/automation.ts Adds disabled-guard pre-check to runNow; correctly scoped to pinned hosts only; throws BAD_REQUEST before any run row is created.
packages/db/src/schema/schema.ts Adds relayTunnelStatus (text with enum, default 'unknown') and relayStatusUpdatedAt columns to v2_hosts; matches the SQL migration.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[QStash fires dispatch] --> B[Verify signature + parse payload]
    B --> C{automation.targetHostId set?}
    C -- No --> D[dispatchAutomation\nauto-route via resolveTargetHost]
    C -- Yes --> E[SELECT relay_tunnel_status\nfrom v2_hosts]
    E --> F{relay_tunnel_status?}
    F -- disabled --> G[INSERT dispatch_failed row\nactionable settings error]
    F -- connecting_initial\nconnecting_recovery --> H{attempt >= 10?}
    H -- Yes --> I[INSERT dispatch_failed row\ntimeout error]
    H -- No --> J[qstash.publishJSON\ndelay=30s, attempt+1]
    F -- connected --> K[dispatchAutomation]
    F -- disconnected\nunknown --> K
    K --> L{isOnline?}
    L -- No --> M[skipped_offline row]
    L -- Yes --> N[createWorkspaceOnHost\nrunAgentOnHost]
    D --> L
Loading

Comments Outside Diff (1)

  1. packages/trpc/src/router/automation/dispatch.ts, line 172-203 (link)

    P1 Auto-routing ignores relay tunnel status

    resolveTargetHost selects relayTunnelStatus for the auto-routed path but never filters on it. The WHERE clause only checks isOnline = true, so a host with relayTunnelStatus = 'disabled' or 'connecting_initial' can be chosen — the exact opaque 502 scenario this PR is supposed to fix. The new gate in route.ts only runs when automation.targetHostId is set, so auto-routed automations (targetHostId = null) receive zero protection and still race the relay handshake or hit a hard error on disabled hosts.

    The fix should add neq(v2Hosts.relayTunnelStatus, 'disabled') to the WHERE clause at minimum, and ideally also filter out 'connecting_initial' / 'connecting_recovery' (or defer in dispatchAutomation for those states, analogous to what route.ts does for pinned hosts).

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/trpc/src/router/automation/dispatch.ts
    Line: 172-203
    
    Comment:
    **Auto-routing ignores relay tunnel status**
    
    `resolveTargetHost` selects `relayTunnelStatus` for the auto-routed path but never filters on it. The `WHERE` clause only checks `isOnline = true`, so a host with `relayTunnelStatus = 'disabled'` or `'connecting_initial'` can be chosen — the exact opaque 502 scenario this PR is supposed to fix. The new gate in `route.ts` only runs when `automation.targetHostId` is set, so auto-routed automations (`targetHostId = null`) receive zero protection and still race the relay handshake or hit a hard error on disabled hosts.
    
    The fix should add `neq(v2Hosts.relayTunnelStatus, 'disabled')` to the `WHERE` clause at minimum, and ideally also filter out `'connecting_initial'` / `'connecting_recovery'` (or defer in `dispatchAutomation` for those states, analogous to what `route.ts` does for pinned hosts).
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/trpc/src/router/automation/dispatch.ts:172-203
**Auto-routing ignores relay tunnel status**

`resolveTargetHost` selects `relayTunnelStatus` for the auto-routed path but never filters on it. The `WHERE` clause only checks `isOnline = true`, so a host with `relayTunnelStatus = 'disabled'` or `'connecting_initial'` can be chosen — the exact opaque 502 scenario this PR is supposed to fix. The new gate in `route.ts` only runs when `automation.targetHostId` is set, so auto-routed automations (`targetHostId = null`) receive zero protection and still race the relay handshake or hit a hard error on disabled hosts.

The fix should add `neq(v2Hosts.relayTunnelStatus, 'disabled')` to the `WHERE` clause at minimum, and ideally also filter out `'connecting_initial'` / `'connecting_recovery'` (or defer in `dispatchAutomation` for those states, analogous to what `route.ts` does for pinned hosts).

Reviews (2): Last reviewed commit: "fix(automation): gate dispatch on relay ..." | Re-trigger Greptile

Comment on lines +22 to +24
export function describeError(err: unknown, phase: DispatchPhase): string {
if (err instanceof DispatchInvariantError) return `${phase}: ${err.message}`;
if (err instanceof RelayDispatchError) return `${phase}: ${err.message}`;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant DispatchInvariantError branch in describeError

DispatchInvariantError extends Error, so the first branch is dead code relative to the third — both produce identical output. This doesn't cause a bug today, but a future contributor might add specialised handling to one branch without noticing the other, creating a silent inconsistency. Either add behaviour that actually differs (e.g. a distinct prefix or extra logging), or remove the branch and rely on the generic Error check.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/trpc/src/router/automation/dispatch-errors.ts
Line: 22-24

Comment:
**Redundant `DispatchInvariantError` branch in `describeError`**

`DispatchInvariantError` extends `Error`, so the first branch is dead code relative to the third — both produce identical output. This doesn't cause a bug today, but a future contributor might add specialised handling to one branch without noticing the other, creating a silent inconsistency. Either add behaviour that actually differs (e.g. a distinct prefix or extra logging), or remove the branch and rely on the generic `Error` check.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Re-trigger cubic

@justincrich justincrich marked this pull request as draft May 26, 2026 19:07
@justincrich justincrich force-pushed the improvement/SUPER-783-automation-new-workspace-target branch from 128d590 to 293f260 Compare May 26, 2026 20:49
@justincrich justincrich changed the title fix(automation): distinguish workspace-create vs agent-run dispatch errors (SUPER-783) fix(automation): gate dispatch on relay tunnel readiness (SUPER-783) May 26, 2026
@justincrich justincrich marked this pull request as ready for review May 26, 2026 20:54
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/trpc/src/router/automation/dispatch.ts (1)

172-200: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Auto-routed dispatches still bypass relay readiness gating.

resolveTargetHost() still returns any online host when automation.targetHostId is null; the newly selected relayTunnelStatus / relayStatusUpdatedAt fields are never consulted. That means auto-routed automations can still dispatch against disabled or connecting_* hosts and hit the same opaque relay failures this PR is trying to eliminate. Please either filter to relay-ready hosts here or surface the chosen host's relay status back to the route so it can apply the same disabled/defer behavior before calling the relay.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/trpc/src/router/automation/dispatch.ts` around lines 172 - 200, The
resolveTargetHost flow currently returns any online host and ignores relay
readiness; update the selection logic in resolveTargetHost (the query using
v2Hosts/v2UsersHosts and automation.targetHostId) to either 1) filter to only
relay-ready hosts by adding conditions on v2Hosts.relayTunnelStatus (exclude
'disabled' and 'connecting_*' states, accept 'ready') and optionally on
v2Hosts.relayStatusUpdatedAt (e.g., recent/valid), or 2) ensure the query still
returns relayTunnelStatus and relayStatusUpdatedAt alongside the host and
propagate those fields back to the caller so the route can apply the same
disabled/defer logic before invoking the relay; make the change in the block
that builds the select/from/where (the const [host] = await dbWs... query) or in
resolveTargetHost so the chosen host's relay readiness is enforced or surfaced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/host-service/src/tunnel/tunnel-client.ts`:
- Around line 124-126: When scheduleReconnect() decides the recovery budget has
been exhausted and emits "disconnected", set a sticky flag (e.g.,
this.recoveryGivenUp or this.stickyDisconnected) on the TunnelClient; then
update connect() where it currently calls this.emitState(this.hasBeenOpen ?
"connecting_recovery" : "connecting_initial") to first check that flag and, if
set, avoid emitting "connecting_recovery" (leave state as "disconnected" or
no-op); make the same change for the other similar block around the code
referenced at lines ~468-486 so subsequent connect attempts don't revert the
hard failure state.

In `@packages/trpc/src/router/host/host.ts`:
- Around line 245-260: The current authorization
(db.query.v2UsersHosts.findFirst) lets any linked user update tunnel status;
tighten it to require an owner-level association so only the host owner can
report tunnel state. Update the findFirst where clause (the v2UsersHosts query
used in this handler) to include a check for the owner role/flag (e.g.,
eq(v2UsersHosts.role, 'owner') or the equivalent owner boolean column) against
ctx.userId, parsed.organizationId and parsed.machineId, and keep throwing
TRPCError(FORBIDDEN) when that stricter check fails; this ensures only the host
owner can write status/reportedAt.

---

Outside diff comments:
In `@packages/trpc/src/router/automation/dispatch.ts`:
- Around line 172-200: The resolveTargetHost flow currently returns any online
host and ignores relay readiness; update the selection logic in
resolveTargetHost (the query using v2Hosts/v2UsersHosts and
automation.targetHostId) to either 1) filter to only relay-ready hosts by adding
conditions on v2Hosts.relayTunnelStatus (exclude 'disabled' and 'connecting_*'
states, accept 'ready') and optionally on v2Hosts.relayStatusUpdatedAt (e.g.,
recent/valid), or 2) ensure the query still returns relayTunnelStatus and
relayStatusUpdatedAt alongside the host and propagate those fields back to the
caller so the route can apply the same disabled/defer logic before invoking the
relay; make the change in the block that builds the select/from/where (the const
[host] = await dbWs... query) or in resolveTargetHost so the chosen host's relay
readiness is enforced or surfaced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 74959131-804e-4976-9328-6bcfc81e27ea

📥 Commits

Reviewing files that changed from the base of the PR and between 77d2354 and 293f260.

📒 Files selected for processing (11)
  • apps/api/src/app/api/automations/dispatch/[id]/route.ts
  • packages/db/drizzle/0057_add_v2_hosts_relay_tunnel_status.sql
  • packages/db/drizzle/meta/0057_snapshot.json
  • packages/db/drizzle/meta/_journal.json
  • packages/db/src/schema/schema.ts
  • packages/host-service/src/serve.ts
  • packages/host-service/src/tunnel/connect.ts
  • packages/host-service/src/tunnel/tunnel-client.ts
  • packages/trpc/src/router/automation/automation.ts
  • packages/trpc/src/router/automation/dispatch.ts
  • packages/trpc/src/router/host/host.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/db/drizzle/0057_add_v2_hosts_relay_tunnel_status.sql

Comment on lines +124 to +126
this.emitState(
this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep disconnected sticky after the recovery budget expires.

Once scheduleReconnect() crosses the give-up window and emits disconnected, the next connect() call immediately emits connecting_recovery again. That puts the cloud row back into a deferrable state on every retry, so dispatch can keep re-enqueueing past the 5-minute budget instead of surfacing the intended hard failure.

Proposed fix
+	private getTunnelStateForAttempt(): TunnelClientState {
+		if (
+			this.hasBeenOpen &&
+			this.recoveryStartedAt !== null &&
+			Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS
+		) {
+			return "disconnected";
+		}
+		return this.hasBeenOpen ? "connecting_recovery" : "connecting_initial";
+	}
+
 	async connect(): Promise<void> {
 		…
-		this.emitState(
-			this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
-		);
+		this.emitState(this.getTunnelStateForAttempt());
 		…
 	}
 
 	private scheduleReconnect(): void {
 		…
-		if (
-			this.hasBeenOpen &&
-			this.recoveryStartedAt !== null &&
-			Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS
-		) {
-			this.emitState("disconnected");
-		} else {
-			this.emitState(
-				this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
-			);
-		}
+		this.emitState(this.getTunnelStateForAttempt());
 		…
 	}

Also applies to: 468-486

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/host-service/src/tunnel/tunnel-client.ts` around lines 124 - 126,
When scheduleReconnect() decides the recovery budget has been exhausted and
emits "disconnected", set a sticky flag (e.g., this.recoveryGivenUp or
this.stickyDisconnected) on the TunnelClient; then update connect() where it
currently calls this.emitState(this.hasBeenOpen ? "connecting_recovery" :
"connecting_initial") to first check that flag and, if set, avoid emitting
"connecting_recovery" (leave state as "disconnected" or no-op); make the same
change for the other similar block around the code referenced at lines ~468-486
so subsequent connect attempts don't revert the hard failure state.

Comment thread packages/trpc/src/router/host/host.ts Outdated
Comment on lines +245 to +260
// Only the host's owner-side machine should write its tunnel state.
// The JWT scope ensures only the actual signed-in user on that host
// can report it.
const access = await db.query.v2UsersHosts.findFirst({
where: and(
eq(v2UsersHosts.userId, ctx.userId),
eq(v2UsersHosts.organizationId, parsed.organizationId),
eq(v2UsersHosts.hostId, parsed.machineId),
),
columns: { hostId: true },
});
if (!access) {
throw new TRPCError({
code: "FORBIDDEN",
message: "No access to this host",
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten who can write relay tunnel status.

This check allows any user with a v2_users_hosts row to post arbitrary status and reportedAt values for the host. Because dispatch decisions now hinge on that field, a shared member can block or defer another machine's automations by spoofing tunnel state. If this endpoint is meant to be host-originated, it needs a narrower authorization check—at minimum owner role.

Proposed fix
 			const access = await db.query.v2UsersHosts.findFirst({
 				where: and(
 					eq(v2UsersHosts.userId, ctx.userId),
 					eq(v2UsersHosts.organizationId, parsed.organizationId),
 					eq(v2UsersHosts.hostId, parsed.machineId),
+					eq(v2UsersHosts.role, "owner"),
 				),
 				columns: { hostId: true },
 			});
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Only the host's owner-side machine should write its tunnel state.
// The JWT scope ensures only the actual signed-in user on that host
// can report it.
const access = await db.query.v2UsersHosts.findFirst({
where: and(
eq(v2UsersHosts.userId, ctx.userId),
eq(v2UsersHosts.organizationId, parsed.organizationId),
eq(v2UsersHosts.hostId, parsed.machineId),
),
columns: { hostId: true },
});
if (!access) {
throw new TRPCError({
code: "FORBIDDEN",
message: "No access to this host",
});
// Only the host's owner-side machine should write its tunnel state.
// The JWT scope ensures only the actual signed-in user on that host
// can report it.
const access = await db.query.v2UsersHosts.findFirst({
where: and(
eq(v2UsersHosts.userId, ctx.userId),
eq(v2UsersHosts.organizationId, parsed.organizationId),
eq(v2UsersHosts.hostId, parsed.machineId),
eq(v2UsersHosts.role, "owner"),
),
columns: { hostId: true },
});
if (!access) {
throw new TRPCError({
code: "FORBIDDEN",
message: "No access to this host",
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/trpc/src/router/host/host.ts` around lines 245 - 260, The current
authorization (db.query.v2UsersHosts.findFirst) lets any linked user update
tunnel status; tighten it to require an owner-level association so only the host
owner can report tunnel state. Update the findFirst where clause (the
v2UsersHosts query used in this handler) to include a check for the owner
role/flag (e.g., eq(v2UsersHosts.role, 'owner') or the equivalent owner boolean
column) against ctx.userId, parsed.organizationId and parsed.machineId, and keep
throwing TRPCError(FORBIDDEN) when that stricter check fails; this ensures only
the host owner can write status/reportedAt.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 11 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/trpc/src/router/host/host.ts">

<violation number="1" location="packages/trpc/src/router/host/host.ts:230">
P1: Client-controlled `reportedAt` can freeze status updates under clock skew because stale-write protection uses `lt(stored, incoming)` with no server-side sanity check on the timestamp. A forward-skewed host clock writes a future `relayStatusUpdatedAt`, permanently rejecting all later legitimate updates and leaving dispatch decisions based on stale tunnel state.</violation>
</file>

<file name="packages/host-service/src/tunnel/tunnel-client.ts">

<violation number="1" location="packages/host-service/src/tunnel/tunnel-client.ts:124">
P1: `connect()` unconditionally emits `connecting_recovery` when `hasBeenOpen` is true, even after `scheduleReconnect()` has already emitted `disconnected` (recovery budget exceeded). The next reconnect attempt overrides the cloud status back to a deferrable state, allowing the dispatch worker to keep re-enqueueing indefinitely past the 5-minute budget. Add a check here for `recoveryStartedAt` exceeding `RECOVERY_GIVEUP_MS` and emit `disconnected` instead of `connecting_recovery` when the budget is already blown.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread packages/trpc/src/router/host/host.ts Outdated
z.object({
hostId: z.string().min(1),
status: z.enum(relayTunnelStatusValues),
reportedAt: z.coerce.date(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Client-controlled reportedAt can freeze status updates under clock skew because stale-write protection uses lt(stored, incoming) with no server-side sanity check on the timestamp. A forward-skewed host clock writes a future relayStatusUpdatedAt, permanently rejecting all later legitimate updates and leaving dispatch decisions based on stale tunnel state.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/trpc/src/router/host/host.ts, line 230:

<comment>Client-controlled `reportedAt` can freeze status updates under clock skew because stale-write protection uses `lt(stored, incoming)` with no server-side sanity check on the timestamp. A forward-skewed host clock writes a future `relayStatusUpdatedAt`, permanently rejecting all later legitimate updates and leaving dispatch decisions based on stale tunnel state.</comment>

<file context>
@@ -212,6 +213,77 @@ export const hostRouter = {
+			z.object({
+				hostId: z.string().min(1),
+				status: z.enum(relayTunnelStatusValues),
+				reportedAt: z.coerce.date(),
+			}),
+		)
</file context>
Suggested change
reportedAt: z.coerce.date(),
reportedAt: z.coerce.date().refine((d) => d <= new Date(Date.now() + 60_000), {
message: "reportedAt cannot be more than 60s in the future",
}),

Comment on lines +124 to +126
this.emitState(
this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: connect() unconditionally emits connecting_recovery when hasBeenOpen is true, even after scheduleReconnect() has already emitted disconnected (recovery budget exceeded). The next reconnect attempt overrides the cloud status back to a deferrable state, allowing the dispatch worker to keep re-enqueueing indefinitely past the 5-minute budget. Add a check here for recoveryStartedAt exceeding RECOVERY_GIVEUP_MS and emit disconnected instead of connecting_recovery when the budget is already blown.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/host-service/src/tunnel/tunnel-client.ts, line 124:

<comment>`connect()` unconditionally emits `connecting_recovery` when `hasBeenOpen` is true, even after `scheduleReconnect()` has already emitted `disconnected` (recovery budget exceeded). The next reconnect attempt overrides the cloud status back to a deferrable state, allowing the dispatch worker to keep re-enqueueing indefinitely past the 5-minute budget. Add a check here for `recoveryStartedAt` exceeding `RECOVERY_GIVEUP_MS` and emit `disconnected` instead of `connecting_recovery` when the budget is already blown.</comment>

<file context>
@@ -62,6 +117,13 @@ export class TunnelClient {
+		// rather than going straight from "unknown" to "connected" with no
+		// intermediate signal — dispatch uses this to decide whether to
+		// defer or fail.
+		this.emitState(
+			this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
+		);
</file context>
Suggested change
this.emitState(
this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
);
if (
this.hasBeenOpen &&
this.recoveryStartedAt !== null &&
Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS
) {
this.emitState("disconnected");
} else {
this.emitState(
this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
);
}

@justincrich justincrich marked this pull request as draft May 26, 2026 21:26
@justincrich justincrich force-pushed the improvement/SUPER-783-automation-new-workspace-target branch from 407d738 to f8e033c Compare May 26, 2026 22:55
@justincrich justincrich changed the title fix(automation): gate dispatch on relay tunnel readiness (SUPER-783) fix(automation): create workspace + launch agent in a single relay call (SUPER-783) May 26, 2026
…ll (SUPER-783)

"New workspace" automations (the default for scheduled agent runs) failed
silently: the workspace would never get created and the agent would never
start. The user-visible symptom was an opaque dispatch failure.

## Root cause

`dispatchAutomation` made two sequential relay round-trips for the
new-workspace path:

  1. `workspaces.create`  → returns workspace id
  2. `agents.run`         → spawns the agent against that id

Each round-trip independently races the tunnel state on the host side.
The first call would usually succeed (workspace gets created in cloud +
host), but the second call would frequently land in a moment where the
tunnel had just dropped — the relay returned 503 "Host not connected"
and the agent never spawned. The cloud wrote the run as `dispatch_failed`
even though a workspace had actually been created on the host.

## Fix

The host's `workspaces.create` handler already supports launching agents
inline via the `agents: []` field (see
`packages/host-service/.../workspaces.ts:1025` — `dispatchSugarAgents`).
The Superset SDK (`packages/sdk/src/resources/workspaces.ts:41`) already
uses this. The dispatch path was the only caller that split create and
launch into two trips.

Pass `agents: [{ agent, prompt }]` in the same `workspaces.create` call.
One relay round-trip instead of two, no second tunnel race, atomic on the
host side. The pinned-workspace path (`automation.v2WorkspaceId !== null`)
is unchanged — still a single `agents.run` call.
@justincrich justincrich force-pushed the improvement/SUPER-783-automation-new-workspace-target branch from f8e033c to a47df6a Compare May 26, 2026 22:59
@justincrich
Copy link
Copy Markdown
Contributor Author

Moving the WHY out of code comments per #4936 (comment).

Why the fix is shaped this way:

The "New workspace" path used to make two sequential relay round-trips:

  1. workspaces.create → returns the new workspace id
  2. agents.run → spawns the agent against that id

Each round-trip independently races the host's tunnel state. The first call would usually succeed (workspace gets created in cloud + host), but the second one would frequently land in a moment where the tunnel had just reconnected and the relay's in-memory routing didn't yet have the host — relay returned 503 "Host not connected" and the agent never spawned. The cloud wrote the run as dispatch_failed even though a workspace had actually been created on the host. That's SUPER-783.

The host's workspaces.create handler already supports launching agents inline via the agents: [] field — see packages/host-service/src/trpc/router/workspaces/workspaces.ts:1025 (dispatchSugarAgents). The Superset SDK (packages/sdk/src/resources/workspaces.ts:41) already uses this. The dispatch path was the only caller that split create and launch into two trips.

This PR passes agents: [{ agent, prompt }] in the same workspaces.create call. One relay round-trip instead of two, no second tunnel race, atomic on the host side. The pinned-workspace path (automation.v2WorkspaceId !== null) is unchanged — still a single agents.run call.

@justincrich
Copy link
Copy Markdown
Contributor Author

🚧 Blocked on staging verification before code review

The change here is small (41 lines, one file) but I cannot self-verify the end-to-end workspace-create flow in local dev. Mirroring the comment I left on SUPER-783needs someone with staging credentials to deploy + test before this is reviewable. I'm on a work trial without staging access.

Why local verification is blocked

I fired 10+ dispatch attempts against the local stack. Every one died before reaching host-service, for two independent reasons that are not caused by this PR:

1. Tunnel flap in this worktree's dev env. host-service ↔ relay WebSocket reconnects every ~1s with no close-code logged:

[host-service:tunnel] connected to relay
[host-service:tunnel] reconnecting in 837ms (attempt 1)
[host-service:tunnel] connected to relay
[host-service:tunnel] reconnecting in 905ms (attempt 1)

Every dispatch hit the relay during a brief disconnected window → relay 503 Host not connected. Host-service never saw the payload. There is no socket.close in my diff.

2. Empty host-service projects table. The host's local SQLite has zero rows in projects. The dispatch targets project a98d58f8… which isn't registered locally. Even with a rock-solid tunnel, requireLocalProject(input.projectId) at packages/host-service/src/trpc/router/workspaces/workspaces.ts:530 throws PROJECT_NOT_SETUP before any agent is spawned.

Why I'm still confident the fix is correct

The pattern this PR adopts is already running in production via the Superset SDK. Every customer calling superset workspaces create --agent claude goes through the inline-agents code path on workspaces.create. The cron dispatch was the only caller that split create + agent-launch into two relay round-trips — the second one was racing the tunnel and producing orphan workspaces. This PR brings dispatch in line with the SDK's working pattern.

  • SDK reference: packages/sdk/src/resources/workspaces.ts:41-58
  • Host handler that does the work: packages/host-service/.../workspaces.ts:1025 (dispatchSugarAgents)

What needs to happen on staging

  1. Deploy this branch (improvement/SUPER-783-automation-new-workspace-target, HEAD a47df6a80) to staging.
  2. Point a dev desktop at staging cloud + staging relay.
  3. Create a "New workspace" automation against that host with any prompt.
  4. Click Run now.
  5. Pass criteriaautomation_runs row lands status='dispatched' with v2_workspace_id set and session_kind set; the new workspace appears on the host; the agent session starts.
  6. Diagnostic signal — exactly one outgoing host.* relay call per dispatch (was two). Two = the change didn't deploy.

Static checks that already pass:

  • bun run typecheck (full monorepo)
  • bun run lint
  • grep confirms dispatch makes one relay call for new-workspace, never followed by agents.run
  • ✅ Payload shape matches host's agentLaunchSchema

Happy to pair with whoever does the staging run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant