fix(automation): create workspace + launch agent in a single relay call (SUPER-783)#4936
Conversation
|
Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews. |
|
Ready to review this PR? Stage has broken it down into 2 individual chapters for you:
Chapters generated by Stage for commit a47df6a on May 26, 2026 10:59pm UTC. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR implements relay tunnel status tracking from hosts to the cloud and gates automation dispatch on relay connectivity. Hosts report tunnel state changes, the cloud API records timestamp-ordered updates, and the dispatch API defers, fails, or proceeds with automations based on the target host's relay status. ChangesRelay Tunnel Status Tracking and Dispatch Validation
Sequence DiagramsequenceDiagram
participant Host as TunnelClient (host)
participant CloudAPI as host.setRelayTunnelStatus
participant DB as v2Hosts
participant DispatchRoute
participant QStash
participant Dispatcher as dispatchAutomation
Host->>CloudAPI: setRelayTunnelStatus(hostId, status, reportedAt)
CloudAPI->>DB: conditional update relayTunnelStatus if reportedAt newer
DispatchRoute->>DB: read relayTunnelStatus for targetHostId
alt disabled
DispatchRoute->>DB: insert dispatch_failed run
DispatchRoute-->>DispatchRoute: return relay_disabled
else connecting_*
DispatchRoute->>QStash: publish deferred request (attempt+1, dedup)
else
DispatchRoute->>Dispatcher: dispatchAutomation(...)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR introduces
Confidence Score: 4/5Safe to merge for the pinned-host path; the auto-routed path still has the original tunnel-readiness gap. The tunnel-status gate works correctly when packages/trpc/src/router/automation/dispatch.ts — the auto-routing query needs to exclude hosts whose relay tunnel status would cause a dispatch failure.
|
| Filename | Overview |
|---|---|
| apps/api/src/app/api/automations/dispatch/[id]/route.ts | Adds tunnel-status gate before relay calls for pinned-host automations; introduces QStash defer-and-retry loop (up to 10×30s). Logic is sound for the explicit-targetHostId path. |
| packages/trpc/src/router/automation/dispatch.ts | Selects relayTunnelStatus in the auto-routing query but never acts on it — disabled/connecting hosts can still be picked for auto-routed automations, leaving the same 502 race for that path. |
| packages/host-service/src/tunnel/tunnel-client.ts | New state machine with hasBeenOpen/recoveryStartedAt tracking. Duplicate emission suppression, recovery giveup at 5 min, and correct disconnected escalation all look solid. |
| packages/trpc/src/router/host/host.ts | New setRelayTunnelStatus mutation with correct stale-write rejection, org membership verification, and routing-key parsing. |
| packages/trpc/src/router/automation/automation.ts | Adds disabled-guard pre-check to runNow; correctly scoped to pinned hosts only; throws BAD_REQUEST before any run row is created. |
| packages/db/src/schema/schema.ts | Adds relayTunnelStatus (text with enum, default 'unknown') and relayStatusUpdatedAt columns to v2_hosts; matches the SQL migration. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[QStash fires dispatch] --> B[Verify signature + parse payload]
B --> C{automation.targetHostId set?}
C -- No --> D[dispatchAutomation\nauto-route via resolveTargetHost]
C -- Yes --> E[SELECT relay_tunnel_status\nfrom v2_hosts]
E --> F{relay_tunnel_status?}
F -- disabled --> G[INSERT dispatch_failed row\nactionable settings error]
F -- connecting_initial\nconnecting_recovery --> H{attempt >= 10?}
H -- Yes --> I[INSERT dispatch_failed row\ntimeout error]
H -- No --> J[qstash.publishJSON\ndelay=30s, attempt+1]
F -- connected --> K[dispatchAutomation]
F -- disconnected\nunknown --> K
K --> L{isOnline?}
L -- No --> M[skipped_offline row]
L -- Yes --> N[createWorkspaceOnHost\nrunAgentOnHost]
D --> L
Comments Outside Diff (1)
-
packages/trpc/src/router/automation/dispatch.ts, line 172-203 (link)Auto-routing ignores relay tunnel status
resolveTargetHostselectsrelayTunnelStatusfor the auto-routed path but never filters on it. TheWHEREclause only checksisOnline = true, so a host withrelayTunnelStatus = 'disabled'or'connecting_initial'can be chosen — the exact opaque 502 scenario this PR is supposed to fix. The new gate inroute.tsonly runs whenautomation.targetHostIdis set, so auto-routed automations (targetHostId = null) receive zero protection and still race the relay handshake or hit a hard error on disabled hosts.The fix should add
neq(v2Hosts.relayTunnelStatus, 'disabled')to theWHEREclause at minimum, and ideally also filter out'connecting_initial'/'connecting_recovery'(or defer indispatchAutomationfor those states, analogous to whatroute.tsdoes for pinned hosts).Prompt To Fix With AI
This is a comment left during a code review. Path: packages/trpc/src/router/automation/dispatch.ts Line: 172-203 Comment: **Auto-routing ignores relay tunnel status** `resolveTargetHost` selects `relayTunnelStatus` for the auto-routed path but never filters on it. The `WHERE` clause only checks `isOnline = true`, so a host with `relayTunnelStatus = 'disabled'` or `'connecting_initial'` can be chosen — the exact opaque 502 scenario this PR is supposed to fix. The new gate in `route.ts` only runs when `automation.targetHostId` is set, so auto-routed automations (`targetHostId = null`) receive zero protection and still race the relay handshake or hit a hard error on disabled hosts. The fix should add `neq(v2Hosts.relayTunnelStatus, 'disabled')` to the `WHERE` clause at minimum, and ideally also filter out `'connecting_initial'` / `'connecting_recovery'` (or defer in `dispatchAutomation` for those states, analogous to what `route.ts` does for pinned hosts). How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
packages/trpc/src/router/automation/dispatch.ts:172-203
**Auto-routing ignores relay tunnel status**
`resolveTargetHost` selects `relayTunnelStatus` for the auto-routed path but never filters on it. The `WHERE` clause only checks `isOnline = true`, so a host with `relayTunnelStatus = 'disabled'` or `'connecting_initial'` can be chosen — the exact opaque 502 scenario this PR is supposed to fix. The new gate in `route.ts` only runs when `automation.targetHostId` is set, so auto-routed automations (`targetHostId = null`) receive zero protection and still race the relay handshake or hit a hard error on disabled hosts.
The fix should add `neq(v2Hosts.relayTunnelStatus, 'disabled')` to the `WHERE` clause at minimum, and ideally also filter out `'connecting_initial'` / `'connecting_recovery'` (or defer in `dispatchAutomation` for those states, analogous to what `route.ts` does for pinned hosts).
Reviews (2): Last reviewed commit: "fix(automation): gate dispatch on relay ..." | Re-trigger Greptile
| export function describeError(err: unknown, phase: DispatchPhase): string { | ||
| if (err instanceof DispatchInvariantError) return `${phase}: ${err.message}`; | ||
| if (err instanceof RelayDispatchError) return `${phase}: ${err.message}`; |
There was a problem hiding this comment.
Redundant
DispatchInvariantError branch in describeError
DispatchInvariantError extends Error, so the first branch is dead code relative to the third — both produce identical output. This doesn't cause a bug today, but a future contributor might add specialised handling to one branch without noticing the other, creating a silent inconsistency. Either add behaviour that actually differs (e.g. a distinct prefix or extra logging), or remove the branch and rely on the generic Error check.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/trpc/src/router/automation/dispatch-errors.ts
Line: 22-24
Comment:
**Redundant `DispatchInvariantError` branch in `describeError`**
`DispatchInvariantError` extends `Error`, so the first branch is dead code relative to the third — both produce identical output. This doesn't cause a bug today, but a future contributor might add specialised handling to one branch without noticing the other, creating a silent inconsistency. Either add behaviour that actually differs (e.g. a distinct prefix or extra logging), or remove the branch and rely on the generic `Error` check.
How can I resolve this? If you propose a fix, please make it concise.128d590 to
293f260
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/trpc/src/router/automation/dispatch.ts (1)
172-200:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftAuto-routed dispatches still bypass relay readiness gating.
resolveTargetHost()still returns any online host whenautomation.targetHostIdis null; the newly selectedrelayTunnelStatus/relayStatusUpdatedAtfields are never consulted. That means auto-routed automations can still dispatch againstdisabledorconnecting_*hosts and hit the same opaque relay failures this PR is trying to eliminate. Please either filter to relay-ready hosts here or surface the chosen host's relay status back to the route so it can apply the same disabled/defer behavior before calling the relay.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/trpc/src/router/automation/dispatch.ts` around lines 172 - 200, The resolveTargetHost flow currently returns any online host and ignores relay readiness; update the selection logic in resolveTargetHost (the query using v2Hosts/v2UsersHosts and automation.targetHostId) to either 1) filter to only relay-ready hosts by adding conditions on v2Hosts.relayTunnelStatus (exclude 'disabled' and 'connecting_*' states, accept 'ready') and optionally on v2Hosts.relayStatusUpdatedAt (e.g., recent/valid), or 2) ensure the query still returns relayTunnelStatus and relayStatusUpdatedAt alongside the host and propagate those fields back to the caller so the route can apply the same disabled/defer logic before invoking the relay; make the change in the block that builds the select/from/where (the const [host] = await dbWs... query) or in resolveTargetHost so the chosen host's relay readiness is enforced or surfaced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/host-service/src/tunnel/tunnel-client.ts`:
- Around line 124-126: When scheduleReconnect() decides the recovery budget has
been exhausted and emits "disconnected", set a sticky flag (e.g.,
this.recoveryGivenUp or this.stickyDisconnected) on the TunnelClient; then
update connect() where it currently calls this.emitState(this.hasBeenOpen ?
"connecting_recovery" : "connecting_initial") to first check that flag and, if
set, avoid emitting "connecting_recovery" (leave state as "disconnected" or
no-op); make the same change for the other similar block around the code
referenced at lines ~468-486 so subsequent connect attempts don't revert the
hard failure state.
In `@packages/trpc/src/router/host/host.ts`:
- Around line 245-260: The current authorization
(db.query.v2UsersHosts.findFirst) lets any linked user update tunnel status;
tighten it to require an owner-level association so only the host owner can
report tunnel state. Update the findFirst where clause (the v2UsersHosts query
used in this handler) to include a check for the owner role/flag (e.g.,
eq(v2UsersHosts.role, 'owner') or the equivalent owner boolean column) against
ctx.userId, parsed.organizationId and parsed.machineId, and keep throwing
TRPCError(FORBIDDEN) when that stricter check fails; this ensures only the host
owner can write status/reportedAt.
---
Outside diff comments:
In `@packages/trpc/src/router/automation/dispatch.ts`:
- Around line 172-200: The resolveTargetHost flow currently returns any online
host and ignores relay readiness; update the selection logic in
resolveTargetHost (the query using v2Hosts/v2UsersHosts and
automation.targetHostId) to either 1) filter to only relay-ready hosts by adding
conditions on v2Hosts.relayTunnelStatus (exclude 'disabled' and 'connecting_*'
states, accept 'ready') and optionally on v2Hosts.relayStatusUpdatedAt (e.g.,
recent/valid), or 2) ensure the query still returns relayTunnelStatus and
relayStatusUpdatedAt alongside the host and propagate those fields back to the
caller so the route can apply the same disabled/defer logic before invoking the
relay; make the change in the block that builds the select/from/where (the const
[host] = await dbWs... query) or in resolveTargetHost so the chosen host's relay
readiness is enforced or surfaced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 74959131-804e-4976-9328-6bcfc81e27ea
📒 Files selected for processing (11)
apps/api/src/app/api/automations/dispatch/[id]/route.tspackages/db/drizzle/0057_add_v2_hosts_relay_tunnel_status.sqlpackages/db/drizzle/meta/0057_snapshot.jsonpackages/db/drizzle/meta/_journal.jsonpackages/db/src/schema/schema.tspackages/host-service/src/serve.tspackages/host-service/src/tunnel/connect.tspackages/host-service/src/tunnel/tunnel-client.tspackages/trpc/src/router/automation/automation.tspackages/trpc/src/router/automation/dispatch.tspackages/trpc/src/router/host/host.ts
✅ Files skipped from review due to trivial changes (1)
- packages/db/drizzle/0057_add_v2_hosts_relay_tunnel_status.sql
| this.emitState( | ||
| this.hasBeenOpen ? "connecting_recovery" : "connecting_initial", | ||
| ); |
There was a problem hiding this comment.
Keep disconnected sticky after the recovery budget expires.
Once scheduleReconnect() crosses the give-up window and emits disconnected, the next connect() call immediately emits connecting_recovery again. That puts the cloud row back into a deferrable state on every retry, so dispatch can keep re-enqueueing past the 5-minute budget instead of surfacing the intended hard failure.
Proposed fix
+ private getTunnelStateForAttempt(): TunnelClientState {
+ if (
+ this.hasBeenOpen &&
+ this.recoveryStartedAt !== null &&
+ Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS
+ ) {
+ return "disconnected";
+ }
+ return this.hasBeenOpen ? "connecting_recovery" : "connecting_initial";
+ }
+
async connect(): Promise<void> {
…
- this.emitState(
- this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
- );
+ this.emitState(this.getTunnelStateForAttempt());
…
}
private scheduleReconnect(): void {
…
- if (
- this.hasBeenOpen &&
- this.recoveryStartedAt !== null &&
- Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS
- ) {
- this.emitState("disconnected");
- } else {
- this.emitState(
- this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
- );
- }
+ this.emitState(this.getTunnelStateForAttempt());
…
}Also applies to: 468-486
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/host-service/src/tunnel/tunnel-client.ts` around lines 124 - 126,
When scheduleReconnect() decides the recovery budget has been exhausted and
emits "disconnected", set a sticky flag (e.g., this.recoveryGivenUp or
this.stickyDisconnected) on the TunnelClient; then update connect() where it
currently calls this.emitState(this.hasBeenOpen ? "connecting_recovery" :
"connecting_initial") to first check that flag and, if set, avoid emitting
"connecting_recovery" (leave state as "disconnected" or no-op); make the same
change for the other similar block around the code referenced at lines ~468-486
so subsequent connect attempts don't revert the hard failure state.
| // Only the host's owner-side machine should write its tunnel state. | ||
| // The JWT scope ensures only the actual signed-in user on that host | ||
| // can report it. | ||
| const access = await db.query.v2UsersHosts.findFirst({ | ||
| where: and( | ||
| eq(v2UsersHosts.userId, ctx.userId), | ||
| eq(v2UsersHosts.organizationId, parsed.organizationId), | ||
| eq(v2UsersHosts.hostId, parsed.machineId), | ||
| ), | ||
| columns: { hostId: true }, | ||
| }); | ||
| if (!access) { | ||
| throw new TRPCError({ | ||
| code: "FORBIDDEN", | ||
| message: "No access to this host", | ||
| }); |
There was a problem hiding this comment.
Tighten who can write relay tunnel status.
This check allows any user with a v2_users_hosts row to post arbitrary status and reportedAt values for the host. Because dispatch decisions now hinge on that field, a shared member can block or defer another machine's automations by spoofing tunnel state. If this endpoint is meant to be host-originated, it needs a narrower authorization check—at minimum owner role.
Proposed fix
const access = await db.query.v2UsersHosts.findFirst({
where: and(
eq(v2UsersHosts.userId, ctx.userId),
eq(v2UsersHosts.organizationId, parsed.organizationId),
eq(v2UsersHosts.hostId, parsed.machineId),
+ eq(v2UsersHosts.role, "owner"),
),
columns: { hostId: true },
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Only the host's owner-side machine should write its tunnel state. | |
| // The JWT scope ensures only the actual signed-in user on that host | |
| // can report it. | |
| const access = await db.query.v2UsersHosts.findFirst({ | |
| where: and( | |
| eq(v2UsersHosts.userId, ctx.userId), | |
| eq(v2UsersHosts.organizationId, parsed.organizationId), | |
| eq(v2UsersHosts.hostId, parsed.machineId), | |
| ), | |
| columns: { hostId: true }, | |
| }); | |
| if (!access) { | |
| throw new TRPCError({ | |
| code: "FORBIDDEN", | |
| message: "No access to this host", | |
| }); | |
| // Only the host's owner-side machine should write its tunnel state. | |
| // The JWT scope ensures only the actual signed-in user on that host | |
| // can report it. | |
| const access = await db.query.v2UsersHosts.findFirst({ | |
| where: and( | |
| eq(v2UsersHosts.userId, ctx.userId), | |
| eq(v2UsersHosts.organizationId, parsed.organizationId), | |
| eq(v2UsersHosts.hostId, parsed.machineId), | |
| eq(v2UsersHosts.role, "owner"), | |
| ), | |
| columns: { hostId: true }, | |
| }); | |
| if (!access) { | |
| throw new TRPCError({ | |
| code: "FORBIDDEN", | |
| message: "No access to this host", | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/trpc/src/router/host/host.ts` around lines 245 - 260, The current
authorization (db.query.v2UsersHosts.findFirst) lets any linked user update
tunnel status; tighten it to require an owner-level association so only the host
owner can report tunnel state. Update the findFirst where clause (the
v2UsersHosts query used in this handler) to include a check for the owner
role/flag (e.g., eq(v2UsersHosts.role, 'owner') or the equivalent owner boolean
column) against ctx.userId, parsed.organizationId and parsed.machineId, and keep
throwing TRPCError(FORBIDDEN) when that stricter check fails; this ensures only
the host owner can write status/reportedAt.
There was a problem hiding this comment.
2 issues found across 11 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/trpc/src/router/host/host.ts">
<violation number="1" location="packages/trpc/src/router/host/host.ts:230">
P1: Client-controlled `reportedAt` can freeze status updates under clock skew because stale-write protection uses `lt(stored, incoming)` with no server-side sanity check on the timestamp. A forward-skewed host clock writes a future `relayStatusUpdatedAt`, permanently rejecting all later legitimate updates and leaving dispatch decisions based on stale tunnel state.</violation>
</file>
<file name="packages/host-service/src/tunnel/tunnel-client.ts">
<violation number="1" location="packages/host-service/src/tunnel/tunnel-client.ts:124">
P1: `connect()` unconditionally emits `connecting_recovery` when `hasBeenOpen` is true, even after `scheduleReconnect()` has already emitted `disconnected` (recovery budget exceeded). The next reconnect attempt overrides the cloud status back to a deferrable state, allowing the dispatch worker to keep re-enqueueing indefinitely past the 5-minute budget. Add a check here for `recoveryStartedAt` exceeding `RECOVERY_GIVEUP_MS` and emit `disconnected` instead of `connecting_recovery` when the budget is already blown.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| z.object({ | ||
| hostId: z.string().min(1), | ||
| status: z.enum(relayTunnelStatusValues), | ||
| reportedAt: z.coerce.date(), |
There was a problem hiding this comment.
P1: Client-controlled reportedAt can freeze status updates under clock skew because stale-write protection uses lt(stored, incoming) with no server-side sanity check on the timestamp. A forward-skewed host clock writes a future relayStatusUpdatedAt, permanently rejecting all later legitimate updates and leaving dispatch decisions based on stale tunnel state.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/trpc/src/router/host/host.ts, line 230:
<comment>Client-controlled `reportedAt` can freeze status updates under clock skew because stale-write protection uses `lt(stored, incoming)` with no server-side sanity check on the timestamp. A forward-skewed host clock writes a future `relayStatusUpdatedAt`, permanently rejecting all later legitimate updates and leaving dispatch decisions based on stale tunnel state.</comment>
<file context>
@@ -212,6 +213,77 @@ export const hostRouter = {
+ z.object({
+ hostId: z.string().min(1),
+ status: z.enum(relayTunnelStatusValues),
+ reportedAt: z.coerce.date(),
+ }),
+ )
</file context>
| reportedAt: z.coerce.date(), | |
| reportedAt: z.coerce.date().refine((d) => d <= new Date(Date.now() + 60_000), { | |
| message: "reportedAt cannot be more than 60s in the future", | |
| }), |
| this.emitState( | ||
| this.hasBeenOpen ? "connecting_recovery" : "connecting_initial", | ||
| ); |
There was a problem hiding this comment.
P1: connect() unconditionally emits connecting_recovery when hasBeenOpen is true, even after scheduleReconnect() has already emitted disconnected (recovery budget exceeded). The next reconnect attempt overrides the cloud status back to a deferrable state, allowing the dispatch worker to keep re-enqueueing indefinitely past the 5-minute budget. Add a check here for recoveryStartedAt exceeding RECOVERY_GIVEUP_MS and emit disconnected instead of connecting_recovery when the budget is already blown.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/host-service/src/tunnel/tunnel-client.ts, line 124:
<comment>`connect()` unconditionally emits `connecting_recovery` when `hasBeenOpen` is true, even after `scheduleReconnect()` has already emitted `disconnected` (recovery budget exceeded). The next reconnect attempt overrides the cloud status back to a deferrable state, allowing the dispatch worker to keep re-enqueueing indefinitely past the 5-minute budget. Add a check here for `recoveryStartedAt` exceeding `RECOVERY_GIVEUP_MS` and emit `disconnected` instead of `connecting_recovery` when the budget is already blown.</comment>
<file context>
@@ -62,6 +117,13 @@ export class TunnelClient {
+ // rather than going straight from "unknown" to "connected" with no
+ // intermediate signal — dispatch uses this to decide whether to
+ // defer or fail.
+ this.emitState(
+ this.hasBeenOpen ? "connecting_recovery" : "connecting_initial",
+ );
</file context>
| this.emitState( | |
| this.hasBeenOpen ? "connecting_recovery" : "connecting_initial", | |
| ); | |
| if ( | |
| this.hasBeenOpen && | |
| this.recoveryStartedAt !== null && | |
| Date.now() - this.recoveryStartedAt > RECOVERY_GIVEUP_MS | |
| ) { | |
| this.emitState("disconnected"); | |
| } else { | |
| this.emitState( | |
| this.hasBeenOpen ? "connecting_recovery" : "connecting_initial", | |
| ); | |
| } |
407d738 to
f8e033c
Compare
…ll (SUPER-783)
"New workspace" automations (the default for scheduled agent runs) failed
silently: the workspace would never get created and the agent would never
start. The user-visible symptom was an opaque dispatch failure.
## Root cause
`dispatchAutomation` made two sequential relay round-trips for the
new-workspace path:
1. `workspaces.create` → returns workspace id
2. `agents.run` → spawns the agent against that id
Each round-trip independently races the tunnel state on the host side.
The first call would usually succeed (workspace gets created in cloud +
host), but the second call would frequently land in a moment where the
tunnel had just dropped — the relay returned 503 "Host not connected"
and the agent never spawned. The cloud wrote the run as `dispatch_failed`
even though a workspace had actually been created on the host.
## Fix
The host's `workspaces.create` handler already supports launching agents
inline via the `agents: []` field (see
`packages/host-service/.../workspaces.ts:1025` — `dispatchSugarAgents`).
The Superset SDK (`packages/sdk/src/resources/workspaces.ts:41`) already
uses this. The dispatch path was the only caller that split create and
launch into two trips.
Pass `agents: [{ agent, prompt }]` in the same `workspaces.create` call.
One relay round-trip instead of two, no second tunnel race, atomic on the
host side. The pinned-workspace path (`automation.v2WorkspaceId !== null`)
is unchanged — still a single `agents.run` call.
f8e033c to
a47df6a
Compare
|
Moving the WHY out of code comments per #4936 (comment). Why the fix is shaped this way: The "New workspace" path used to make two sequential relay round-trips:
Each round-trip independently races the host's tunnel state. The first call would usually succeed (workspace gets created in cloud + host), but the second one would frequently land in a moment where the tunnel had just reconnected and the relay's in-memory routing didn't yet have the host — relay returned The host's This PR passes |
🚧 Blocked on staging verification before code reviewThe change here is small (41 lines, one file) but I cannot self-verify the end-to-end workspace-create flow in local dev. Mirroring the comment I left on SUPER-783 — needs someone with staging credentials to deploy + test before this is reviewable. I'm on a work trial without staging access. Why local verification is blockedI fired 10+ dispatch attempts against the local stack. Every one died before reaching host-service, for two independent reasons that are not caused by this PR: 1. Tunnel flap in this worktree's dev env. host-service ↔ relay WebSocket reconnects every ~1s with no close-code logged: Every dispatch hit the relay during a brief disconnected window → 2. Empty host-service projects table. The host's local SQLite has zero rows in Why I'm still confident the fix is correctThe pattern this PR adopts is already running in production via the Superset SDK. Every customer calling
What needs to happen on staging
Static checks that already pass:
Happy to pair with whoever does the staging run. |
Summary
Closes SUPER-783. "New workspace" automations (the default for scheduled agent runs) failed silently — the workspace would never get created and the agent would never start. User saw an opaque dispatch failure with no indication why.
Root cause
dispatchAutomationmade two sequential relay round-trips for the new-workspace path:workspaces.create→ returns the new workspace idagents.run→ spawns the agent against that idEach round-trip independently races the host's tunnel state. The first call would usually succeed (workspace gets created in cloud + host), but the second call would frequently land in a moment where the tunnel had just reconnected and the relay's in-memory routing didn't yet have the host — relay returned
503 "Host not connected"and the agent never spawned. The cloud wrote the run asdispatch_failedeven though a workspace had actually been created on the host.Fix
The host's
workspaces.createhandler already supports launching agents inline via theagents: []field — seepackages/host-service/src/trpc/router/workspaces/workspaces.ts:1025(dispatchSugarAgents). The Superset SDK (packages/sdk/src/resources/workspaces.ts:41) already uses this. The dispatch path was the only caller that split create and launch into two trips.This PR passes
agents: [{ agent, prompt }]in the sameworkspaces.createcall. One relay round-trip instead of two, no second tunnel race, atomic on the host side.The pinned-workspace path (
automation.v2WorkspaceId !== null) is unchanged — still a singleagents.runcall.Diff
Test plan
automation_runsrow withstatus='dispatched',v2_workspace_idpopulated,session_kindset)dispatch_failedautomation.v2WorkspaceId !== nullpath unchanged)dispatch_failedwith the host's error message instead of an opaque relay error