fix(automation): create workspace + launch agent in a single relay call (SUPER-783) by justincrich · Pull Request #4936 · superset-sh/superset

justincrich · 2026-05-26T18:48:16Z

Summary

Closes SUPER-783. "New workspace" automations (the default for scheduled agent runs) failed silently — the workspace would never get created and the agent would never start. User saw an opaque dispatch failure with no indication why.

Root cause

dispatchAutomation made two sequential relay round-trips for the new-workspace path:

workspaces.create → returns the new workspace id
agents.run → spawns the agent against that id

Each round-trip independently races the host's tunnel state. The first call would usually succeed (workspace gets created in cloud + host), but the second call would frequently land in a moment where the tunnel had just reconnected and the relay's in-memory routing didn't yet have the host — relay returned 503 "Host not connected" and the agent never spawned. The cloud wrote the run as dispatch_failed even though a workspace had actually been created on the host.

Fix

The host's workspaces.create handler already supports launching agents inline via the agents: [] field — see packages/host-service/src/trpc/router/workspaces/workspaces.ts:1025 (dispatchSugarAgents). The Superset SDK (packages/sdk/src/resources/workspaces.ts:41) already uses this. The dispatch path was the only caller that split create and launch into two trips.

This PR passes agents: [{ agent, prompt }] in the same workspaces.create call. One relay round-trip instead of two, no second tunnel race, atomic on the host side.

The pinned-workspace path (automation.v2WorkspaceId !== null) is unchanged — still a single agents.run call.

Diff

packages/trpc/src/router/automation/dispatch.ts | 84 ++++++++++++++++++-------
1 file changed, 60 insertions(+), 24 deletions(-)

Test plan

Schedule an automation with New workspace target on a host with a healthy tunnel → workspace gets created AND agent session starts (one automation_runs row with status='dispatched', v2_workspace_id populated, session_kind set)
Run history shows the new workspace + the agent session as a normal run, not dispatch_failed
Pinned-workspace automation still works (regression: automation.v2WorkspaceId !== null path unchanged)
Host-side agent-launch errors (e.g. project not set up locally) propagate as dispatch_failed with the host's error message instead of an opaque relay error

capy-ai · 2026-05-26T18:48:28Z

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

stage-review · 2026-05-26T18:48:33Z

Ready to review this PR? Stage has broken it down into 2 individual chapters for you:

	Title
1	Update workspace creation helper to include agents
2	Consolidate automation dispatch into single relay call

_{Chapters generated by Stage for commit a47df6a on May 26, 2026 10:59pm UTC.}

coderabbitai · 2026-05-26T18:48:35Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 926ed0b6-1220-4202-ae01-438367938c11

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR implements relay tunnel status tracking from hosts to the cloud and gates automation dispatch on relay connectivity. Hosts report tunnel state changes, the cloud API records timestamp-ordered updates, and the dispatch API defers, fails, or proceeds with automations based on the target host's relay status.

Changes

Relay Tunnel Status Tracking and Dispatch Validation

Layer / File(s)	Summary
Relay status schema and types `packages/db/drizzle/0057_add_v2_hosts_relay_tunnel_status.sql`, `packages/db/drizzle/meta/_journal.json`, `packages/db/src/schema/schema.ts`	Database migration adds `relay_tunnel_status` (default `unknown`) and `relay_status_updated_at` columns to `v2_hosts`. Schema exports `RelayTunnelStatus` type and `relayTunnelStatusValues` enum (unknown, disabled, connecting_initial, connecting_recovery, connected, disconnected).
Tunnel client state machine lifecycle `packages/host-service/src/tunnel/tunnel-client.ts`	`TunnelClient` introduces a `TunnelClientState` state machine that distinguishes initial connect, recovery reconnect, successful connection, and disconnected states. It tracks recovery duration and emits `disconnected` once elapsed time exceeds `RECOVERY_GIVEUP_MS`, with deduplication and error handling around state change callbacks.
Host relay status reporting `packages/host-service/src/serve.ts`, `packages/host-service/src/tunnel/connect.ts`	Host service reports relay status on startup (disabled when relay URL not set) and subscribes tunnel state changes via `onStateChange` callback. `connectRelay` computes the host routing key and wires the callback to publish state transitions to the cloud with best-effort error handling.
Cloud relay status mutation endpoint `packages/trpc/src/router/host/host.ts`	New `setRelayTunnelStatus` mutation accepts `hostId`, `status`, and `reportedAt`, validates routing key and caller access, and conditionally updates `v2Hosts.relayTunnelStatus` and `v2Hosts.relayStatusUpdatedAt` only when `reportedAt` is newer than the stored timestamp (or when stored is null).
Dispatch relay validation and deferral `apps/api/src/app/api/automations/dispatch/[id]/route.ts`, `packages/trpc/src/router/automation/automation.ts`, `packages/trpc/src/router/automation/dispatch.ts`	Dispatch route gates execution on target host `relayTunnelStatus`: `disabled` records a `dispatch_failed` run and returns `relay_disabled`; `connecting_initial`/`connecting_recovery` defers via QStash with an incremented `attempt` counter and per-attempt deduplication up to `MAX_DEFER_ATTEMPTS`, after which it records a timeout failure and returns `deferred_timeout`; other statuses proceed to normal dispatch. `automationRouter.runNow` also pre-checks and rejects disabled relay with `BAD_REQUEST`.

Sequence Diagram

sequenceDiagram
  participant Host as TunnelClient (host)
  participant CloudAPI as host.setRelayTunnelStatus
  participant DB as v2Hosts
  participant DispatchRoute
  participant QStash
  participant Dispatcher as dispatchAutomation

  Host->>CloudAPI: setRelayTunnelStatus(hostId, status, reportedAt)
  CloudAPI->>DB: conditional update relayTunnelStatus if reportedAt newer
  DispatchRoute->>DB: read relayTunnelStatus for targetHostId
  alt disabled
    DispatchRoute->>DB: insert dispatch_failed run
    DispatchRoute-->>DispatchRoute: return relay_disabled
  else connecting_*
    DispatchRoute->>QStash: publish deferred request (attempt+1, dedup)
  else
    DispatchRoute->>Dispatcher: dispatchAutomation(...)
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

superset-sh/superset#4594: Overlaps on tunnel-client reconnection lifecycle adjustments; both modify reconnect behavior and state handling.
superset-sh/superset#4539: Also modifies tunnel-client connection lifecycle; related to connect-phase timing and cleanup logic.
superset-sh/superset#4734: Related changes in automations dispatch flow and resolveTargetHost / dispatchAutomation logic.

Poem

🐰 I hop and watch the tunnel hum,
tiny states that come and run—
a whisper to the cloud I send,
"Please wait, or fail, or try again."
Relay paths lead work back home.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title describes gating dispatch on relay tunnel readiness, but the raw_summary and PR description reveal the fix is more nuanced: it adds relay tunnel status tracking, conditional retry logic via QStash, and pre-checks—not a simple atomic gate. The title is partially related but oversimplifies the main change.
Description check	✅ Passed	The PR description is comprehensive, covering root cause, fix strategy, and test plan. It follows the template with Summary (Related Issues via SUPER-783), Root cause explanation, and detailed Test plan sections. All critical context is present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-05-26T18:50:57Z

Greptile Summary

This PR introduces v2_hosts.relay_tunnel_status as a first-class cloud signal to gate the automation dispatch worker on relay tunnel readiness, fixing silent failures where automations dispatched during cold boot or with relay disabled surface an opaque 502. The host-service emits state transitions (connecting_initial, connecting_recovery, connected, disconnected, disabled) via a new setRelayTunnelStatus tRPC mutation, and the dispatch route branches on those states to either proceed, defer via QStash (up to 10 × 30 s), or write an actionable error row.

Dispatch gate (route.ts): reads relay_tunnel_status before calling the relay for pinned-host automations and re-enqueues with a 30 s delay when the host is still connecting, or writes a clear settings-pointing error when relay is disabled.
TunnelClient state machine (tunnel-client.ts): tracks hasBeenOpen and recoveryStartedAt to distinguish initial vs. recovery connecting phases and escalates to disconnected after 5 min of unsuccessful reconnects.
runNow guard (automation.ts): pre-checks disabled status for pinned hosts and throws BAD_REQUEST before creating any run row, so the failure feels like inline validation rather than a spawned failure.

Confidence Score: 4/5

Safe to merge for the pinned-host path; the auto-routed path still has the original tunnel-readiness gap.

The tunnel-status gate works correctly when targetHostId is explicitly set. However, resolveTargetHost in dispatch.ts — used for all auto-routed automations (targetHostId = null) — selects relayTunnelStatus but never filters or branches on it, so a host with disabled or connecting_initial status can still be chosen and hit the relay, reproducing the exact opaque 502 this PR is meant to eliminate.

packages/trpc/src/router/automation/dispatch.ts — the auto-routing query needs to exclude hosts whose relay tunnel status would cause a dispatch failure.

Important Files Changed

Filename	Overview
apps/api/src/app/api/automations/dispatch/[id]/route.ts	Adds tunnel-status gate before relay calls for pinned-host automations; introduces QStash defer-and-retry loop (up to 10×30s). Logic is sound for the explicit-targetHostId path.
packages/trpc/src/router/automation/dispatch.ts	Selects `relayTunnelStatus` in the auto-routing query but never acts on it — disabled/connecting hosts can still be picked for auto-routed automations, leaving the same 502 race for that path.
packages/host-service/src/tunnel/tunnel-client.ts	New state machine with `hasBeenOpen`/`recoveryStartedAt` tracking. Duplicate emission suppression, recovery giveup at 5 min, and correct `disconnected` escalation all look solid.
packages/trpc/src/router/host/host.ts	New `setRelayTunnelStatus` mutation with correct stale-write rejection, org membership verification, and routing-key parsing.
packages/trpc/src/router/automation/automation.ts	Adds `disabled`-guard pre-check to `runNow`; correctly scoped to pinned hosts only; throws `BAD_REQUEST` before any run row is created.
packages/db/src/schema/schema.ts	Adds `relayTunnelStatus` (text with enum, default `'unknown'`) and `relayStatusUpdatedAt` columns to `v2_hosts`; matches the SQL migration.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[QStash fires dispatch] --> B[Verify signature + parse payload]
    B --> C{automation.targetHostId set?}
    C -- No --> D[dispatchAutomation\nauto-route via resolveTargetHost]
    C -- Yes --> E[SELECT relay_tunnel_status\nfrom v2_hosts]
    E --> F{relay_tunnel_status?}
    F -- disabled --> G[INSERT dispatch_failed row\nactionable settings error]
    F -- connecting_initial\nconnecting_recovery --> H{attempt >= 10?}
    H -- Yes --> I[INSERT dispatch_failed row\ntimeout error]
    H -- No --> J[qstash.publishJSON\ndelay=30s, attempt+1]
    F -- connected --> K[dispatchAutomation]
    F -- disconnected\nunknown --> K
    K --> L{isOnline?}
    L -- No --> M[skipped_offline row]
    L -- Yes --> N[createWorkspaceOnHost\nrunAgentOnHost]
    D --> L

Comments Outside Diff (1)

packages/trpc/src/router/automation/dispatch.ts, line 172-203 (link)

Auto-routing ignores relay tunnel status

resolveTargetHost selects relayTunnelStatus for the auto-routed path but never filters on it. The WHERE clause only checks isOnline = true, so a host with relayTunnelStatus = 'disabled' or 'connecting_initial' can be chosen — the exact opaque 502 scenario this PR is supposed to fix. The new gate in route.ts only runs when automation.targetHostId is set, so auto-routed automations (targetHostId = null) receive zero protection and still race the relay handshake or hit a hard error on disabled hosts.

The fix should add neq(v2Hosts.relayTunnelStatus, 'disabled') to the WHERE clause at minimum, and ideally also filter out 'connecting_initial' / 'connecting_recovery' (or defer in dispatchAutomation for those states, analogous to what route.ts does for pinned hosts).

Prompt To Fix With AI

This is a comment left during a code review.
Path: packages/trpc/src/router/automation/dispatch.ts
Line: 172-203

Comment:
**Auto-routing ignores relay tunnel status**

`resolveTargetHost` selects `relayTunnelStatus` for the auto-routed path but never filters on it. The `WHERE` clause only checks `isOnline = true`, so a host with `relayTunnelStatus = 'disabled'` or `'connecting_initial'` can be chosen — the exact opaque 502 scenario this PR is supposed to fix. The new gate in `route.ts` only runs when `automation.targetHostId` is set, so auto-routed automations (`targetHostId = null`) receive zero protection and still race the relay handshake or hit a hard error on disabled hosts.

The fix should add `neq(v2Hosts.relayTunnelStatus, 'disabled')` to the `WHERE` clause at minimum, and ideally also filter out `'connecting_initial'` / `'connecting_recovery'` (or defer in `dispatchAutomation` for those states, analogous to what `route.ts` does for pinned hosts).

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/trpc/src/router/automation/dispatch.ts:172-203
**Auto-routing ignores relay tunnel status**

`resolveTargetHost` selects `relayTunnelStatus` for the auto-routed path but never filters on it. The `WHERE` clause only checks `isOnline = true`, so a host with `relayTunnelStatus = 'disabled'` or `'connecting_initial'` can be chosen — the exact opaque 502 scenario this PR is supposed to fix. The new gate in `route.ts` only runs when `automation.targetHostId` is set, so auto-routed automations (`targetHostId = null`) receive zero protection and still race the relay handshake or hit a hard error on disabled hosts.

The fix should add `neq(v2Hosts.relayTunnelStatus, 'disabled')` to the `WHERE` clause at minimum, and ideally also filter out `'connecting_initial'` / `'connecting_recovery'` (or defer in `dispatchAutomation` for those states, analogous to what `route.ts` does for pinned hosts).

_{Reviews (2): Last reviewed commit: "fix(automation): gate dispatch on relay ..." | Re-trigger Greptile}

greptile-apps · 2026-05-26T18:51:01Z

+export function describeError(err: unknown, phase: DispatchPhase): string {
+	if (err instanceof DispatchInvariantError) return `${phase}: ${err.message}`;
+	if (err instanceof RelayDispatchError) return `${phase}: ${err.message}`;


Redundant DispatchInvariantError branch in describeError

DispatchInvariantError extends Error, so the first branch is dead code relative to the third — both produce identical output. This doesn't cause a bug today, but a future contributor might add specialised handling to one branch without noticing the other, creating a silent inconsistency. Either add behaviour that actually differs (e.g. a distinct prefix or extra logging), or remove the branch and rely on the generic Error check.

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/trpc/src/router/automation/dispatch-errors.ts Line: 22-24 Comment: **Redundant `DispatchInvariantError` branch in `describeError`** `DispatchInvariantError` extends `Error`, so the first branch is dead code relative to the third — both produce identical output. This doesn't cause a bug today, but a future contributor might add specialised handling to one branch without noticing the other, creating a silent inconsistency. Either add behaviour that actually differs (e.g. a distinct prefix or extra logging), or remove the branch and rely on the generic `Error` check. How can I resolve this? If you propose a fix, please make it concise.

cubic-dev-ai

No issues found across 4 files

_{Re-trigger cubic}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/trpc/src/router/automation/dispatch.ts (1)
172-200: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Auto-routed dispatches still bypass relay readiness gating.

resolveTargetHost() still returns any online host when automation.targetHostId is null; the newly selected relayTunnelStatus / relayStatusUpdatedAt fields are never consulted. That means auto-routed automations can still dispatch against disabled or connecting_* hosts and hit the same opaque relay failures this PR is trying to eliminate. Please either filter to relay-ready hosts here or surface the chosen host's relay status back to the route so it can apply the same disabled/defer behavior before calling the relay.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/trpc/src/router/automation/dispatch.ts` around lines 172 - 200, The
resolveTargetHost flow currently returns any online host and ignores relay
readiness; update the selection logic in resolveTargetHost (the query using
v2Hosts/v2UsersHosts and automation.targetHostId) to either 1) filter to only
relay-ready hosts by adding conditions on v2Hosts.relayTunnelStatus (exclude
'disabled' and 'connecting_*' states, accept 'ready') and optionally on
v2Hosts.relayStatusUpdatedAt (e.g., recent/valid), or 2) ensure the query still
returns relayTunnelStatus and relayStatusUpdatedAt alongside the host and
propagate those fields back to the caller so the route can apply the same
disabled/defer logic before invoking the relay; make the change in the block
that builds the select/from/where (the const [host] = await dbWs... query) or in
resolveTargetHost so the chosen host's relay readiness is enforced or surfaced.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/host-service/src/tunnel/tunnel-client.ts`:
- Around line 124-126: When scheduleReconnect() decides the recovery budget has
been exhausted and emits "disconnected", set a sticky flag (e.g.,
this.recoveryGivenUp or this.stickyDisconnected) on the TunnelClient; then
update connect() where it currently calls this.emitState(this.hasBeenOpen ?
"connecting_recovery" : "connecting_initial") to first check that flag and, if
set, avoid emitting "connecting_recovery" (leave state as "disconnected" or
no-op); make the same change for the other similar block around the code
referenced at lines ~468-486 so subsequent connect attempts don't revert the
hard failure state.

In `@packages/trpc/src/router/host/host.ts`:
- Around line 245-260: The current authorization
(db.query.v2UsersHosts.findFirst) lets any linked user update tunnel status;
tighten it to require an owner-level association so only the host owner can
report tunnel state. Update the findFirst where clause (the v2UsersHosts query
used in this handler) to include a check for the owner role/flag (e.g.,
eq(v2UsersHosts.role, 'owner') or the equivalent owner boolean column) against
ctx.userId, parsed.organizationId and parsed.machineId, and keep throwing
TRPCError(FORBIDDEN) when that stricter check fails; this ensures only the host
owner can write status/reportedAt.

---

Outside diff comments:
In `@packages/trpc/src/router/automation/dispatch.ts`:
- Around line 172-200: The resolveTargetHost flow currently returns any online
host and ignores relay readiness; update the selection logic in
resolveTargetHost (the query using v2Hosts/v2UsersHosts and
automation.targetHostId) to either 1) filter to only relay-ready hosts by adding
conditions on v2Hosts.relayTunnelStatus (exclude 'disabled' and 'connecting_*'
states, accept 'ready') and optionally on v2Hosts.relayStatusUpdatedAt (e.g.,
recent/valid), or 2) ensure the query still returns relayTunnelStatus and
relayStatusUpdatedAt alongside the host and propagate those fields back to the
caller so the route can apply the same disabled/defer logic before invoking the
relay; make the change in the block that builds the select/from/where (the const
[host] = await dbWs... query) or in resolveTargetHost so the chosen host's relay
readiness is enforced or surfaced.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 74959131-804e-4976-9328-6bcfc81e27ea

📥 Commits

Reviewing files that changed from the base of the PR and between 77d2354 and 293f260.

📒 Files selected for processing (11)

apps/api/src/app/api/automations/dispatch/[id]/route.ts
packages/db/drizzle/0057_add_v2_hosts_relay_tunnel_status.sql
packages/db/drizzle/meta/0057_snapshot.json
packages/db/drizzle/meta/_journal.json
packages/db/src/schema/schema.ts
packages/host-service/src/serve.ts
packages/host-service/src/tunnel/connect.ts
packages/host-service/src/tunnel/tunnel-client.ts
packages/trpc/src/router/automation/automation.ts
packages/trpc/src/router/automation/dispatch.ts
packages/trpc/src/router/host/host.ts

✅ Files skipped from review due to trivial changes (1)

packages/db/drizzle/0057_add_v2_hosts_relay_tunnel_status.sql

coderabbitai · 2026-05-26T21:02:40Z

+		this.emitState(
+			this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
+		);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep disconnected sticky after the recovery budget expires.

Once scheduleReconnect() crosses the give-up window and emits disconnected, the next connect() call immediately emits connecting_recovery again. That puts the cloud row back into a deferrable state on every retry, so dispatch can keep re-enqueueing past the 5-minute budget instead of surfacing the intended hard failure.

Proposed fix

+ private getTunnelStateForAttempt(): TunnelClientState { + if ( + this.hasBeenOpen && + this.recoveryStartedAt !== null && + Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS + ) { + return "disconnected"; + } + return this.hasBeenOpen ? "connecting_recovery" : "connecting_initial"; + } + async connect(): Promise<void> { … - this.emitState( - this.hasBeenOpen ? "connecting_recovery" : "connecting_initial", - ); + this.emitState(this.getTunnelStateForAttempt()); … } private scheduleReconnect(): void { … - if ( - this.hasBeenOpen && - this.recoveryStartedAt !== null && - Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS - ) { - this.emitState("disconnected"); - } else { - this.emitState( - this.hasBeenOpen ? "connecting_recovery" : "connecting_initial", - ); - } + this.emitState(this.getTunnelStateForAttempt()); … }

Also applies to: 468-486

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/host-service/src/tunnel/tunnel-client.ts` around lines 124 - 126, When scheduleReconnect() decides the recovery budget has been exhausted and emits "disconnected", set a sticky flag (e.g., this.recoveryGivenUp or this.stickyDisconnected) on the TunnelClient; then update connect() where it currently calls this.emitState(this.hasBeenOpen ? "connecting_recovery" : "connecting_initial") to first check that flag and, if set, avoid emitting "connecting_recovery" (leave state as "disconnected" or no-op); make the same change for the other similar block around the code referenced at lines ~468-486 so subsequent connect attempts don't revert the hard failure state.

coderabbitai · 2026-05-26T21:02:40Z

+			// Only the host's owner-side machine should write its tunnel state.
+			// The JWT scope ensures only the actual signed-in user on that host
+			// can report it.
+			const access = await db.query.v2UsersHosts.findFirst({
+				where: and(
+					eq(v2UsersHosts.userId, ctx.userId),
+					eq(v2UsersHosts.organizationId, parsed.organizationId),
+					eq(v2UsersHosts.hostId, parsed.machineId),
+				),
+				columns: { hostId: true },
+			});
+			if (!access) {
+				throw new TRPCError({
+					code: "FORBIDDEN",
+					message: "No access to this host",
+				});


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten who can write relay tunnel status.

This check allows any user with a v2_users_hosts row to post arbitrary status and reportedAt values for the host. Because dispatch decisions now hinge on that field, a shared member can block or defer another machine's automations by spoofing tunnel state. If this endpoint is meant to be host-originated, it needs a narrower authorization check—at minimum owner role.

Proposed fix

const access = await db.query.v2UsersHosts.findFirst({ where: and( eq(v2UsersHosts.userId, ctx.userId), eq(v2UsersHosts.organizationId, parsed.organizationId), eq(v2UsersHosts.hostId, parsed.machineId), + eq(v2UsersHosts.role, "owner"), ), columns: { hostId: true }, });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Only the host's owner-side machine should write its tunnel state.

// The JWT scope ensures only the actual signed-in user on that host

// can report it.

const access = await db.query.v2UsersHosts.findFirst({

where: and(

eq(v2UsersHosts.userId, ctx.userId),

eq(v2UsersHosts.organizationId, parsed.organizationId),

eq(v2UsersHosts.hostId, parsed.machineId),

),

columns: { hostId: true },

});

if (!access) {

throw new TRPCError({

code: "FORBIDDEN",

message: "No access to this host",

});

// Only the host's owner-side machine should write its tunnel state.

// The JWT scope ensures only the actual signed-in user on that host

// can report it.

const access = await db.query.v2UsersHosts.findFirst({

where: and(

eq(v2UsersHosts.userId, ctx.userId),

eq(v2UsersHosts.organizationId, parsed.organizationId),

eq(v2UsersHosts.hostId, parsed.machineId),

eq(v2UsersHosts.role, "owner"),

),

columns: { hostId: true },

});

if (!access) {

throw new TRPCError({

code: "FORBIDDEN",

message: "No access to this host",

});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/trpc/src/router/host/host.ts` around lines 245 - 260, The current authorization (db.query.v2UsersHosts.findFirst) lets any linked user update tunnel status; tighten it to require an owner-level association so only the host owner can report tunnel state. Update the findFirst where clause (the v2UsersHosts query used in this handler) to include a check for the owner role/flag (e.g., eq(v2UsersHosts.role, 'owner') or the equivalent owner boolean column) against ctx.userId, parsed.organizationId and parsed.machineId, and keep throwing TRPCError(FORBIDDEN) when that stricter check fails; this ensures only the host owner can write status/reportedAt.

cubic-dev-ai

2 issues found across 11 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/trpc/src/router/host/host.ts">

<violation number="1" location="packages/trpc/src/router/host/host.ts:230">
P1: Client-controlled `reportedAt` can freeze status updates under clock skew because stale-write protection uses `lt(stored, incoming)` with no server-side sanity check on the timestamp. A forward-skewed host clock writes a future `relayStatusUpdatedAt`, permanently rejecting all later legitimate updates and leaving dispatch decisions based on stale tunnel state.</violation>
</file>

<file name="packages/host-service/src/tunnel/tunnel-client.ts">

<violation number="1" location="packages/host-service/src/tunnel/tunnel-client.ts:124">
P1: `connect()` unconditionally emits `connecting_recovery` when `hasBeenOpen` is true, even after `scheduleReconnect()` has already emitted `disconnected` (recovery budget exceeded). The next reconnect attempt overrides the cloud status back to a deferrable state, allowing the dispatch worker to keep re-enqueueing indefinitely past the 5-minute budget. Add a check here for `recoveryStartedAt` exceeding `RECOVERY_GIVEUP_MS` and emit `disconnected` instead of `connecting_recovery` when the budget is already blown.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-05-26T21:05:49Z

+			z.object({
+				hostId: z.string().min(1),
+				status: z.enum(relayTunnelStatusValues),
+				reportedAt: z.coerce.date(),


P1: Client-controlled reportedAt can freeze status updates under clock skew because stale-write protection uses lt(stored, incoming) with no server-side sanity check on the timestamp. A forward-skewed host clock writes a future relayStatusUpdatedAt, permanently rejecting all later legitimate updates and leaving dispatch decisions based on stale tunnel state.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/trpc/src/router/host/host.ts, line 230: <comment>Client-controlled `reportedAt` can freeze status updates under clock skew because stale-write protection uses `lt(stored, incoming)` with no server-side sanity check on the timestamp. A forward-skewed host clock writes a future `relayStatusUpdatedAt`, permanently rejecting all later legitimate updates and leaving dispatch decisions based on stale tunnel state.</comment> <file context> @@ -212,6 +213,77 @@ export const hostRouter = { + z.object({ + hostId: z.string().min(1), + status: z.enum(relayTunnelStatusValues), + reportedAt: z.coerce.date(), + }), + ) </file context>

Suggested change

reportedAt: z.coerce.date(),

reportedAt: z.coerce.date().refine((d) => d <= new Date(Date.now() + 60_000), {

message: "reportedAt cannot be more than 60s in the future",

}),

cubic-dev-ai · 2026-05-26T21:05:50Z

+		this.emitState(
+			this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
+		);


P1: connect() unconditionally emits connecting_recovery when hasBeenOpen is true, even after scheduleReconnect() has already emitted disconnected (recovery budget exceeded). The next reconnect attempt overrides the cloud status back to a deferrable state, allowing the dispatch worker to keep re-enqueueing indefinitely past the 5-minute budget. Add a check here for recoveryStartedAt exceeding RECOVERY_GIVEUP_MS and emit disconnected instead of connecting_recovery when the budget is already blown.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/host-service/src/tunnel/tunnel-client.ts, line 124: <comment>`connect()` unconditionally emits `connecting_recovery` when `hasBeenOpen` is true, even after `scheduleReconnect()` has already emitted `disconnected` (recovery budget exceeded). The next reconnect attempt overrides the cloud status back to a deferrable state, allowing the dispatch worker to keep re-enqueueing indefinitely past the 5-minute budget. Add a check here for `recoveryStartedAt` exceeding `RECOVERY_GIVEUP_MS` and emit `disconnected` instead of `connecting_recovery` when the budget is already blown.</comment> <file context> @@ -62,6 +117,13 @@ export class TunnelClient { + // rather than going straight from "unknown" to "connected" with no + // intermediate signal — dispatch uses this to decide whether to + // defer or fail. + this.emitState( + this.hasBeenOpen ? "connecting_recovery" : "connecting_initial", + ); </file context>

Suggested change

this.emitState(

this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",

);

if (

this.hasBeenOpen &&

this.recoveryStartedAt !== null &&

Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS

) {

this.emitState("disconnected");

} else {

this.emitState(

this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",

);

}

…ll (SUPER-783) "New workspace" automations (the default for scheduled agent runs) failed silently: the workspace would never get created and the agent would never start. The user-visible symptom was an opaque dispatch failure. ## Root cause `dispatchAutomation` made two sequential relay round-trips for the new-workspace path: 1. `workspaces.create` → returns workspace id 2. `agents.run` → spawns the agent against that id Each round-trip independently races the tunnel state on the host side. The first call would usually succeed (workspace gets created in cloud + host), but the second call would frequently land in a moment where the tunnel had just dropped — the relay returned 503 "Host not connected" and the agent never spawned. The cloud wrote the run as `dispatch_failed` even though a workspace had actually been created on the host. ## Fix The host's `workspaces.create` handler already supports launching agents inline via the `agents: []` field (see `packages/host-service/.../workspaces.ts:1025` — `dispatchSugarAgents`). The Superset SDK (`packages/sdk/src/resources/workspaces.ts:41`) already uses this. The dispatch path was the only caller that split create and launch into two trips. Pass `agents: [{ agent, prompt }]` in the same `workspaces.create` call. One relay round-trip instead of two, no second tunnel race, atomic on the host side. The pinned-workspace path (`automation.v2WorkspaceId !== null`) is unchanged — still a single `agents.run` call.

justincrich · 2026-05-26T22:59:57Z

Moving the WHY out of code comments per #4936 (comment).

Why the fix is shaped this way:

The "New workspace" path used to make two sequential relay round-trips:

workspaces.create → returns the new workspace id
agents.run → spawns the agent against that id

Each round-trip independently races the host's tunnel state. The first call would usually succeed (workspace gets created in cloud + host), but the second one would frequently land in a moment where the tunnel had just reconnected and the relay's in-memory routing didn't yet have the host — relay returned 503 "Host not connected" and the agent never spawned. The cloud wrote the run as dispatch_failed even though a workspace had actually been created on the host. That's SUPER-783.

The host's workspaces.create handler already supports launching agents inline via the agents: [] field — see packages/host-service/src/trpc/router/workspaces/workspaces.ts:1025 (dispatchSugarAgents). The Superset SDK (packages/sdk/src/resources/workspaces.ts:41) already uses this. The dispatch path was the only caller that split create and launch into two trips.

This PR passes agents: [{ agent, prompt }] in the same workspaces.create call. One relay round-trip instead of two, no second tunnel race, atomic on the host side. The pinned-workspace path (automation.v2WorkspaceId !== null) is unchanged — still a single agents.run call.

justincrich · 2026-05-26T23:08:42Z

🚧 Blocked on staging verification before code review

The change here is small (41 lines, one file) but I cannot self-verify the end-to-end workspace-create flow in local dev. Mirroring the comment I left on SUPER-783 — needs someone with staging credentials to deploy + test before this is reviewable. I'm on a work trial without staging access.

Why local verification is blocked

I fired 10+ dispatch attempts against the local stack. Every one died before reaching host-service, for two independent reasons that are not caused by this PR:

1. Tunnel flap in this worktree's dev env. host-service ↔ relay WebSocket reconnects every ~1s with no close-code logged:

[host-service:tunnel] connected to relay
[host-service:tunnel] reconnecting in 837ms (attempt 1)
[host-service:tunnel] connected to relay
[host-service:tunnel] reconnecting in 905ms (attempt 1)

Every dispatch hit the relay during a brief disconnected window → relay 503 Host not connected. Host-service never saw the payload. There is no socket.close in my diff.

2. Empty host-service projects table. The host's local SQLite has zero rows in projects. The dispatch targets project a98d58f8… which isn't registered locally. Even with a rock-solid tunnel, requireLocalProject(input.projectId) at packages/host-service/src/trpc/router/workspaces/workspaces.ts:530 throws PROJECT_NOT_SETUP before any agent is spawned.

Why I'm still confident the fix is correct

The pattern this PR adopts is already running in production via the Superset SDK. Every customer calling superset workspaces create --agent claude goes through the inline-agents code path on workspaces.create. The cron dispatch was the only caller that split create + agent-launch into two relay round-trips — the second one was racing the tunnel and producing orphan workspaces. This PR brings dispatch in line with the SDK's working pattern.

SDK reference: packages/sdk/src/resources/workspaces.ts:41-58
Host handler that does the work: packages/host-service/.../workspaces.ts:1025 (dispatchSugarAgents)

What needs to happen on staging

Deploy this branch (improvement/SUPER-783-automation-new-workspace-target, HEAD a47df6a80) to staging.
Point a dev desktop at staging cloud + staging relay.
Create a "New workspace" automation against that host with any prompt.
Click Run now.
Pass criteria — automation_runs row lands status='dispatched' with v2_workspace_id set and session_kind set; the new workspace appears on the host; the agent session starts.
Diagnostic signal — exactly one outgoing host.* relay call per dispatch (was two). Two = the change didn't deploy.

Static checks that already pass:

✅ bun run typecheck (full monorepo)
✅ bun run lint
✅ grep confirms dispatch makes one relay call for new-workspace, never followed by agents.run
✅ Payload shape matches host's agentLaunchSchema

Happy to pair with whoever does the staging run.

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 26, 2026

View reviewed changes

justincrich marked this pull request as draft May 26, 2026 19:07

justincrich force-pushed the improvement/SUPER-783-automation-new-workspace-target branch from 128d590 to 293f260 Compare May 26, 2026 20:49

justincrich changed the title ~~fix(automation): distinguish workspace-create vs agent-run dispatch errors (SUPER-783)~~ fix(automation): gate dispatch on relay tunnel readiness (SUPER-783) May 26, 2026

justincrich marked this pull request as ready for review May 26, 2026 20:54

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 26, 2026

View reviewed changes

justincrich marked this pull request as draft May 26, 2026 21:26

justincrich force-pushed the improvement/SUPER-783-automation-new-workspace-target branch from 407d738 to f8e033c Compare May 26, 2026 22:55

justincrich changed the title ~~fix(automation): gate dispatch on relay tunnel readiness (SUPER-783)~~ fix(automation): create workspace + launch agent in a single relay call (SUPER-783) May 26, 2026

justincrich force-pushed the improvement/SUPER-783-automation-new-workspace-target branch from f8e033c to a47df6a Compare May 26, 2026 22:59

-				reportedAt: z.coerce.date(),
+				reportedAt: z.coerce.date().refine((d) => d <= new Date(Date.now() + 60_000), {
+					message: "reportedAt cannot be more than 60s in the future",
+				}),

-		this.emitState(
-			this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
-		);
+		if (
+			this.hasBeenOpen &&
+			this.recoveryStartedAt !== null &&
+			Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS
+		) {
+			this.emitState("disconnected");
+		} else {
+			this.emitState(
+				this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
+			);
+		}

Conversation

justincrich commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Diff

Test plan

Uh oh!

capy-ai Bot commented May 26, 2026

Uh oh!

stage-review Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

justincrich commented May 26, 2026

Uh oh!

justincrich commented May 26, 2026

🚧 Blocked on staging verification before code review

Why local verification is blocked

Why I'm still confident the fix is correct

What needs to happen on staging

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

justincrich commented May 26, 2026 •

edited

Loading

stage-review Bot commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading

greptile-apps Bot commented May 26, 2026 •

edited

Loading