feat(api): reconcile host online status from relay directory#4476
Conversation
Adds a QStash-triggered route at /api/hosts/jobs/sync-presence that reads the relay's Upstash tunnel-TTL sorted set and diffs against v2_hosts.is_online, flipping any rows where DB state has drifted from the actual connected set. Existing behavior — relay calls host.setOnline on tunnel open/close — remains the primary path; this is the safety net for relay crashes and host.setOnline retry-exhaustion, both of which left rows stuck at is_online=true with no self-healing mechanism. Reconciler is scheduled every minute via QStash. Reads relay:tunnel-ttl directly from Upstash rather than hopping through a new relay HTTP endpoint — the API already has the same KV credentials. Also: - cli `hosts list` relabels disconnected rows as "local" when the row matches the local machine via getHostId() — gives users running the daemon on the same box a clearer "running but not relay-connected" signal versus a generic "no". - apps/relay/fly.staging.toml: config for a parallel staging Fly app (superset-relay-staging), targeted at individual users via the existing relay-url-override PostHog flag for E2E testing.
|
Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews. |
📝 WalkthroughWalkthroughAdds a POST API endpoint that reconciles v2_hosts.is_online from connected tunnel entries in Upstash Redis, a Fly staging config for the relay, and a CLI change that marks the local host as "local" in hosts list output. ChangesHost Presence Sync Infrastructure
CLI Local Host Display
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts (1)
36-37: ⚡ Quick winUse
request.urlfor QStash signature verification to prevent config drift.The JWT 'sub' claim must match the actual endpoint URL. Using
request.urlensures verification against the real incoming request rather than a potentially mismatched environment variable, especially across different deployment environments or behind proxies.Proposed fix
- url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`, + url: request.url,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 36 - 37, The QStash signature verification is currently using env.NEXT_PUBLIC_API_URL to build the expected request URL which can drift from the actual incoming URL; update the verification logic in the route handler that checks the JWT 'sub' claim so it uses request.url instead of env.NEXT_PUBLIC_API_URL (i.e., replace uses of env.NEXT_PUBLIC_API_URL when constructing the expected 'sub' URL in the QStash signature verification code inside the sync-presence route) so the JWT 'sub' matches the real incoming request URL every time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`:
- Around line 36-37: The QStash signature verification is currently using
env.NEXT_PUBLIC_API_URL to build the expected request URL which can drift from
the actual incoming URL; update the verification logic in the route handler that
checks the JWT 'sub' claim so it uses request.url instead of
env.NEXT_PUBLIC_API_URL (i.e., replace uses of env.NEXT_PUBLIC_API_URL when
constructing the expected 'sub' URL in the QStash signature verification code
inside the sync-presence route) so the JWT 'sub' matches the real incoming
request URL every time.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 981fe690-ecdf-4f49-b020-0141a9f37bec
📒 Files selected for processing (3)
apps/api/src/app/api/hosts/jobs/sync-presence/route.tsapps/relay/fly.staging.tomlpackages/cli/src/commands/hosts/list/command.ts
Greptile SummaryThis PR adds a QStash-scheduled presence reconciler (
Confidence Score: 3/5The reconciler logic and key format are correct, but a DB failure during the update throws an unhandled exception with no log output, and a missing or wrong Redis key would silently flip every host offline. The core logic is sound and the key format aligns with buildHostRoutingKey. However, the db.execute block lacks error handling, and redis.zrange on a missing key returns [] without error, so a misconfigured KV URL would cause a bulk offline flip of every host with no observability signal. apps/api/src/app/api/hosts/jobs/sync-presence/route.ts needs a try-catch around db.execute and a guard for the empty-connected-array case before the route is safe to run every minute in production.
|
| Filename | Overview |
|---|---|
| apps/api/src/app/api/hosts/jobs/sync-presence/route.ts | New QStash-triggered reconciler that diffs Upstash sorted-set membership against v2_hosts.is_online; Redis read is guarded but the DB update has no error handling, and an absent key silently bulk-flips all hosts offline. |
| apps/relay/fly.staging.toml | Staging Fly app config for superset-relay-staging; mirrors the production relay shape with identical health-check and concurrency settings. |
| packages/cli/src/commands/hosts/list/command.ts | Adds a 'local' label for the current machine's host entry when is_online is false; compares row.id (machineId HMAC) against getHostId() which is the correct identifier. |
Sequence Diagram
sequenceDiagram
participant QS as QStash Scheduler
participant API as POST /api/hosts/jobs/sync-presence
participant UP as Upstash Redis
participant DB as PostgreSQL (v2_hosts)
QS->>API: POST (with upstash-signature)
API->>API: Verify QStash signature
API->>UP: ZRANGEBYSCORE relay:tunnel-ttl Date.now() +inf
UP-->>API: connected[] (organizationId:machineId members)
API->>DB: "UPDATE v2_hosts SET is_online = (id IN connected) WHERE is_online IS DISTINCT FROM expected"
DB-->>API: flipped rows
API-->>QS: "200 { connected, flippedOn, flippedOff }"
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:57-76
**Unhandled DB failure crashes the reconciler**
The Redis call above has a `try/catch` that returns a `502`, but the `db.execute` block has none. A transient DB connection error (or a Postgres type error if `connected` somehow produces a malformed `ANY` cast) will throw an unhandled exception — causing Next.js to return a generic 500 with no log message and no `flippedOn`/`flippedOff` counts. QStash will retry into the same silent failure, making this very hard to diagnose. It should be wrapped the same way the Redis read is.
### Issue 2 of 2
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:44-55
**Empty `connected` set silently marks every host offline**
`redis.zrange` on a missing key returns `[]` without throwing — so if `relay:tunnel-ttl` is absent (wrong `KV_REST_API_URL` secret, fresh deployment, accidental key deletion), the reconciler successfully receives an empty array and the `UPDATE` query flips every host to `is_online = false`. There's no way to distinguish "no hosts are genuinely connected" from "wrong Redis instance." Consider logging a warning or returning early when `connected.length === 0`, or adding a sanity threshold before committing the write.
Reviews (1): Last reviewed commit: "feat(api): reconcile host online status ..." | Re-trigger Greptile
There was a problem hiding this comment.
1 issue found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts">
<violation number="1" location="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:38">
P2: Do not silently swallow signature verification errors; log the caught error before returning `false` so real verification failures are observable.
(Based on your team's feedback about surfacing async failures instead of empty catches.) [FEEDBACK_USED]</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| const result = await db.execute<{ | ||
| organization_id: string; | ||
| machine_id: string; | ||
| is_online: boolean; | ||
| }>(sql` | ||
| WITH desired AS ( | ||
| SELECT | ||
| organization_id, | ||
| machine_id, | ||
| (organization_id::text || ':' || machine_id) = ANY(${connected}::text[]) AS expected | ||
| FROM v2_hosts | ||
| ) | ||
| UPDATE v2_hosts h | ||
| SET is_online = d.expected | ||
| FROM desired d | ||
| WHERE h.organization_id = d.organization_id | ||
| AND h.machine_id = d.machine_id | ||
| AND h.is_online IS DISTINCT FROM d.expected | ||
| RETURNING h.organization_id, h.machine_id, h.is_online | ||
| `); |
There was a problem hiding this comment.
Unhandled DB failure crashes the reconciler
The Redis call above has a try/catch that returns a 502, but the db.execute block has none. A transient DB connection error (or a Postgres type error if connected somehow produces a malformed ANY cast) will throw an unhandled exception — causing Next.js to return a generic 500 with no log message and no flippedOn/flippedOff counts. QStash will retry into the same silent failure, making this very hard to diagnose. It should be wrapped the same way the Redis read is.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/api/src/app/api/hosts/jobs/sync-presence/route.ts
Line: 57-76
Comment:
**Unhandled DB failure crashes the reconciler**
The Redis call above has a `try/catch` that returns a `502`, but the `db.execute` block has none. A transient DB connection error (or a Postgres type error if `connected` somehow produces a malformed `ANY` cast) will throw an unhandled exception — causing Next.js to return a generic 500 with no log message and no `flippedOn`/`flippedOff` counts. QStash will retry into the same silent failure, making this very hard to diagnose. It should be wrapped the same way the Redis read is.
How can I resolve this? If you propose a fix, please make it concise.| let connected: string[]; | ||
| try { | ||
| connected = await redis.zrange<string[]>( | ||
| RELAY_TTL_KEY, | ||
| Date.now(), | ||
| "+inf", | ||
| { byScore: true }, | ||
| ); | ||
| } catch (error) { | ||
| console.error("[sync-presence] redis read failed:", error); | ||
| return Response.json({ error: "Directory read failed" }, { status: 502 }); | ||
| } |
There was a problem hiding this comment.
Empty
connected set silently marks every host offline
redis.zrange on a missing key returns [] without throwing — so if relay:tunnel-ttl is absent (wrong KV_REST_API_URL secret, fresh deployment, accidental key deletion), the reconciler successfully receives an empty array and the UPDATE query flips every host to is_online = false. There's no way to distinguish "no hosts are genuinely connected" from "wrong Redis instance." Consider logging a warning or returning early when connected.length === 0, or adding a sanity threshold before committing the write.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/api/src/app/api/hosts/jobs/sync-presence/route.ts
Line: 44-55
Comment:
**Empty `connected` set silently marks every host offline**
`redis.zrange` on a missing key returns `[]` without throwing — so if `relay:tunnel-ttl` is absent (wrong `KV_REST_API_URL` secret, fresh deployment, accidental key deletion), the reconciler successfully receives an empty array and the `UPDATE` query flips every host to `is_online = false`. There's no way to distinguish "no hosts are genuinely connected" from "wrong Redis instance." Consider logging a warning or returning early when `connected.length === 0`, or adding a sanity threshold before committing the write.
How can I resolve this? If you propose a fix, please make it concise.
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
- Wrap the reconcile UPDATE in try/catch and log on failure. Without this, a transient DB error throws into Next.js as an opaque 500, QStash silently retries into the same failure, no flippedOn/flippedOff to anchor diagnosis. - Refuse to write when the Upstash directory returns an empty connected set. An absent or misrouted relay:tunnel-ttl key would otherwise silently flip every host to is_online=false. The relay's event-driven setOnline path still handles genuine disconnects, so skipping here is safe. - Log on the QStash signature verify catch instead of swallowing the error.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts (2)
86-100: ⚖️ Poor tradeoffReconciliation does a full
v2_hostsscan every minute.The CTE
SELECT … FROM v2_hostsmaterialises every row to computeexpected, regardless of how many are actually misaligned. For a small fleet this is fine, but at scale (100k+ hosts × 1 run/min) it becomes a recurring full-table read on a hot OLTP table. Two cheap improvements you might consider before this becomes a problem:
- Restrict the CTE to candidate rows only:
WHERE is_online = true OR (organization_id::text || ':' || machine_id) = ANY(...)— that limits the scan to currently-online hosts plus the connected set, which is the actual diff surface.- Composite index on
v2_hosts (organization_id, machine_id)if one doesn't already exist (the join in the UPDATE depends on it being indexed for the planner to avoid a second seq scan).Worth flagging now since the QStash schedule is fixed at 1/min; if you expect host count to grow an order of magnitude, this is the path that gets expensive first.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 86 - 100, The CTE named desired scans all rows in v2_hosts each run; restrict it to the actual candidate set by adding a WHERE clause to the CTE such as WHERE is_online = true OR (organization_id::text || ':' || machine_id) = ANY(${connected}::text[]) so only currently-online hosts and the connected set are materialized before the UPDATE, and ensure there is a composite index on (organization_id, machine_id) (create an index if none exists) so the UPDATE ... FROM desired join on organization_id and machine_id uses the index rather than triggering a second seq scan.
32-37: 💤 Low valueQStash signature verification URL vulnerable to
NEXT_PUBLIC_API_URLformatting with trailing slashes.QStash signs the exact URL it POSTs to. If
NEXT_PUBLIC_API_URLis configured with a trailing slash (e.g.,https://api.example.com/), the constructed URL becomes malformed (https://…//api/hosts/jobs/sync-presence) and signature verification fails silently.The env validation (
z.string().url()) accepts but does not normalize trailing slashes. A normalization pattern already exists elsewhere in the codebase (apps/api/src/trpc/context.tsline 9). Consider normalizingNEXT_PUBLIC_API_URLat the env level or at the verification call site:♻️ Defensive normalisation
+const API_BASE = env.NEXT_PUBLIC_API_URL.replace(/\/+$/, ""); ... - url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`, + url: `${API_BASE}/api/hosts/jobs/sync-presence`,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 32 - 37, The QStash signature verification uses receiver.verify with url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`, which breaks if NEXT_PUBLIC_API_URL has a trailing slash; normalize the base URL before verification (either at env parsing or immediately before calling receiver.verify) by removing any trailing slash (e.g., use a trimEnd('/') equivalent) and then build the full path, ensuring receiver.verify receives the exact URL QStash signed; update references in route.ts around receiver.verify and/or the env parsing to use the normalized NEXT_PUBLIC_API_URL.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`:
- Around line 49-54: The handler reads bare hostId values into connected from
RELAY_TTL_KEY but later compares against SQL rows built as
organization_id:machine_id, so expected will always be false; fix by
reverse-mapping each hostId using the tunnel-owner HSET (the "tunnel-owner" key)
to obtain the corresponding machine_id/organization_id pair and reconstruct the
compound key before comparison (use redis.hget/hmget on the tunnel-owner mapping
for the hostId(s) retrieved into connected and build
`${organization_id}:${machine_id}` to compare with the SQL identifier), or
alternatively change relay's zadd in apps/relay/src/directory.ts to store
`${organizationId}:${hostId}` as the member so the formats match—implement one
of these two fixes and keep RELAY_TTL_KEY/connected usage consistent with the
chosen mapping approach.
- Around line 47-58: Add explicit per-driver timeouts: recreate or ensure the
Redis client used by this route is constructed with a signal (e.g. new Redis({
..., signal: AbortSignal.timeout(5000) })) so all redis.zrange calls have a 5s
deadline, and replace any existing redis client usage in this file (including
the other call sites around lines 80-105) to use that client; for DB calls
(db.execute in this file) wrap the execute call with an AbortSignal timeout or
add a SQL statement timeout so the query has a bounded deadline (e.g. call
db.execute within AbortSignal.timeout(...) or include a query-level
statement_timeout), and ensure the catch blocks still handle and log AbortErrors
consistently.
---
Nitpick comments:
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`:
- Around line 86-100: The CTE named desired scans all rows in v2_hosts each run;
restrict it to the actual candidate set by adding a WHERE clause to the CTE such
as WHERE is_online = true OR (organization_id::text || ':' || machine_id) =
ANY(${connected}::text[]) so only currently-online hosts and the connected set
are materialized before the UPDATE, and ensure there is a composite index on
(organization_id, machine_id) (create an index if none exists) so the UPDATE ...
FROM desired join on organization_id and machine_id uses the index rather than
triggering a second seq scan.
- Around line 32-37: The QStash signature verification uses receiver.verify with
url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`, which breaks if
NEXT_PUBLIC_API_URL has a trailing slash; normalize the base URL before
verification (either at env parsing or immediately before calling
receiver.verify) by removing any trailing slash (e.g., use a trimEnd('/')
equivalent) and then build the full path, ensuring receiver.verify receives the
exact URL QStash signed; update references in route.ts around receiver.verify
and/or the env parsing to use the normalized NEXT_PUBLIC_API_URL.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 408bb6a9-dcf7-4fe2-be5f-38c453129cee
📒 Files selected for processing (1)
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts
| let connected: string[]; | ||
| try { | ||
| connected = await redis.zrange<string[]>( | ||
| RELAY_TTL_KEY, | ||
| Date.now(), | ||
| "+inf", | ||
| { byScore: true }, | ||
| ); | ||
| } catch (error) { | ||
| console.error("[sync-presence] redis read failed:", error); | ||
| return Response.json({ error: "Directory read failed" }, { status: 502 }); | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🌐 Web query:
@upstash/redis 1.37 zrange AbortSignal option
💡 Result:
@upstash/redis (upstash/redis-js) does not support an AbortSignal option on the ZRANGE command options (e.g., there is no zrange(..., { AbortSignal:... }) or zrange(..., { signal:... }) supported in the command’s option type). What it does support is passing an AbortSignal to the Redis client itself, via the client configuration. The Node.js platform config includes a signal option (“The signal will allow aborting requests on the fly”) [1]. Additionally, the underlying HTTP request layer for commands accepts signal?: AbortSignal [2], and the command execution passes signal: this.signal into the request [3]. How to use (example) 1) Create the client with a signal const controller = new AbortController; const redis = new Redis({ url: process.env.UPSTASH_REDIS_REST_URL!, token: process.env.UPSTASH_REST_TOKEN!, signal: controller.signal, }); // cancel in-flight request controller.abort; This pattern is shown directly in Upstash’s client configuration docs [4]. 2) Or set a timeout signal (example uses AbortSignal.timeout) Upstash’s advanced docs show configuring the SDK with signal: () => AbortSignal.timeout(1000) [5]. So for your exact query (“@upstash/redis 1.37 zrange AbortSignal option”): use client-level signal, not a per-command zrange option; zrange options in code include only ZRANGE-specific flags like withScores, rev, byScore, byLex, and pagination (offset/count) [6], and Upstash’s ZRANGE docs list only those ZRANGE arguments (with no AbortSignal mention) [7].
Citations:
- 1: https://github.com/upstash/redis-js/blob/fbead4fb/platforms/nodejs.ts
- 2: https://github.com/upstash/redis-js/blob/fbead4fb/pkg/http.ts
- 3: https://github.com/upstash/redis-js/blob/fbead4fb/pkg/commands/command.ts
- 4: https://mintlify.com/upstash/redis-js/concepts/client-configuration
- 5: https://upstash.com/docs/redis/sdks/ts/advanced
- 6: https://github.com/upstash/redis-js/blob/fbead4fb/pkg/commands/zrange.ts
- 7: https://upstash.com/docs/redis/sdks/ts/commands/zset/zrange
🏁 Script executed:
# Check package.json for `@upstash/redis` and drizzle versions
find . -name "package.json" -type f | head -5 | xargs grep -l "@upstash/redis\|drizzle"Repository: superset-sh/superset
Length of output: 177
🏁 Script executed:
# Read the complete route file to understand the implementation
cat -n apps/api/src/app/api/hosts/jobs/sync-presence/route.tsRepository: superset-sh/superset
Length of output: 3984
🏁 Script executed:
# Check for Redis client initialization in the codebase
fd -e ts -e tsx | xargs grep -l "new Redis\|Redis({" | head -10Repository: superset-sh/superset
Length of output: 279
🏁 Script executed:
# Check `@upstash/redis` and drizzle versions in package.json
cat apps/api/package.json | grep -A 2 -B 2 "@upstash/redis\|drizzle-orm"Repository: superset-sh/superset
Length of output: 315
🏁 Script executed:
# Check how Redis client is used elsewhere in the codebase
rg "redis\.(zrange|execute)" --type ts --type tsx -B 2 -A 2Repository: superset-sh/superset
Length of output: 92
🏁 Script executed:
# Look for any existing timeout patterns in the codebase
rg "AbortSignal|statement_timeout|timeout" --type ts --type tsx apps/api | head -20Repository: superset-sh/superset
Length of output: 92
🏁 Script executed:
# Search for redis usage patterns more broadly
rg "redis\." apps/api --type ts -B 1 -A 1 | head -30Repository: superset-sh/superset
Length of output: 298
🏁 Script executed:
# Check Drizzle documentation patterns and db.execute in codebase
rg "db\.execute" apps/api --type ts -B 2 -A 2Repository: superset-sh/superset
Length of output: 438
🏁 Script executed:
# Look for any timeout patterns in API routes
rg "timeout|AbortController|AbortSignal" apps/api --type tsRepository: superset-sh/superset
Length of output: 46
🏁 Script executed:
# Check Drizzle db client configuration in the codebase
find . -name "*db*" -type f -path "*/packages/db/*" | head -10Repository: superset-sh/superset
Length of output: 119
🏁 Script executed:
# Look for db client initialization
rg "createPool|postgres\(" --type ts -B 2 -A 5 | head -40Repository: superset-sh/superset
Length of output: 46
🏁 Script executed:
# Find the db client file
find . -path "*/db/client*" -type fRepository: superset-sh/superset
Length of output: 46
🏁 Script executed:
# Find db package structure
ls -la packages/db/Repository: superset-sh/superset
Length of output: 485
🏁 Script executed:
# Check for database configuration
rg "new Database|new Pool|postgres\(" packages/ --type ts -B 1 -A 3 | head -50Repository: superset-sh/superset
Length of output: 1693
🏁 Script executed:
# Read the db client configuration
cat packages/db/src/client.tsRepository: superset-sh/superset
Length of output: 619
🏁 Script executed:
# Check drizzle-orm types for execute method
rg "execute.*function\|execute.*Promise" packages/ --type ts -B 2 -A 5 | head -50Repository: superset-sh/superset
Length of output: 46
🏁 Script executed:
# Check how db is actually imported in the sync-presence route
head -10 apps/api/src/app/api/hosts/jobs/sync-presence/route.tsRepository: superset-sh/superset
Length of output: 329
🏁 Script executed:
# Check if there are any other examples of timeout handling in the codebase
rg "controller.*abort\|AbortController" . --type ts -B 2 -A 2Repository: superset-sh/superset
Length of output: 46
🏁 Script executed:
# Verify what db import path resolves to and check for timeout patterns
rg "@superset/db" apps/ --type ts -B 1 -A 1 | head -20Repository: superset-sh/superset
Length of output: 1378
Add explicit timeouts to the Redis and DB calls with correct per-driver approaches.
QStash will retry on non-2xx, but a hung request can sit on a Node handler indefinitely and tie up retries behind it. Neither redis.zrange nor db.execute here has a deadline. However, the implementation approach depends on the drivers:
- Redis (
@upstash/redis1.37.0): Does not supportAbortSignalin zrange options. Instead, passsignalto the Redis client constructor:new Redis({ ..., signal: AbortSignal.timeout(5000) }). This applies to all commands. - Database (Neon serverless + Drizzle 0.45.2): The Neon serverless driver is HTTP-based. Use SQL-level statement timeouts within the query or wrap with
AbortSignal.timeout()on thedb.execute()call itself.
A bounded timeout keeps the reconciler self-healing under upstream slowness.
Also applies to: 80-105
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 47 - 58,
Add explicit per-driver timeouts: recreate or ensure the Redis client used by
this route is constructed with a signal (e.g. new Redis({ ..., signal:
AbortSignal.timeout(5000) })) so all redis.zrange calls have a 5s deadline, and
replace any existing redis client usage in this file (including the other call
sites around lines 80-105) to use that client; for DB calls (db.execute in this
file) wrap the execute call with an AbortSignal timeout or add a SQL statement
timeout so the query has a bounded deadline (e.g. call db.execute within
AbortSignal.timeout(...) or include a query-level statement_timeout), and ensure
the catch blocks still handle and log AbortErrors consistently.
| connected = await redis.zrange<string[]>( | ||
| RELAY_TTL_KEY, | ||
| Date.now(), | ||
| "+inf", | ||
| { byScore: true }, | ||
| ); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the relay's directory contract matches this consumer.
fd -t f 'directory\.(ts|js)$' apps/relay
rg -nP -C3 'tunnel-ttl|relay:tunnel' apps/relay
# Look at how scores are computed and how members are formatted.
rg -nP -C5 '\b(zadd|ZADD|expireAt|score)\b' apps/relay
rg -nP -C3 'machine[_-]?id.*organization|organization.*machine' apps/relay/srcRepository: superset-sh/superset
Length of output: 4490
Critical mismatch: relay writes hostId members, but SQL expects organization_id:machine_id format.
The relay stores ZSET members as bare hostId values (line 111 in apps/relay/src/directory.ts):
redis.zadd(TTL_KEY, { score: now + TTL_GRACE_MS, member: hostId })
This handler retrieves that array into connected and then attempts to match it against a SQL WHERE clause that builds identifiers as organization_id || ':' || machine_id. Since the formats do not match, every row will have expected = false and all connected hosts will be flipped offline silently.
Fix: Either
- Have the relay write members as
${organizationId}:${hostId}(requires changes toapps/relay/src/directory.tsregister and sweep), or - Parse the
hostIdfromconnected[]to look up the correspondingorganization_idin a separate index before comparing, or - Use the
tunnel-ownerHSET to reverse-maphostId→machineIdand reconstruct the compound key.
The score units (milliseconds) and key name (relay:tunnel-ttl) are correct.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 49 - 54,
The handler reads bare hostId values into connected from RELAY_TTL_KEY but later
compares against SQL rows built as organization_id:machine_id, so expected will
always be false; fix by reverse-mapping each hostId using the tunnel-owner HSET
(the "tunnel-owner" key) to obtain the corresponding machine_id/organization_id
pair and reconstruct the compound key before comparison (use redis.hget/hmget on
the tunnel-owner mapping for the hostId(s) retrieved into connected and build
`${organization_id}:${machine_id}` to compare with the SQL identifier), or
alternatively change relay's zadd in apps/relay/src/directory.ts to store
`${organizationId}:${hostId}` as the member so the formats match—implement one
of these two fixes and keep RELAY_TTL_KEY/connected usage consistent with the
chosen mapping approach.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts">
<violation number="1" location="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:63">
P1: Skipping reconciliation when `connected` is empty breaks the legitimate “all hosts offline” case and can leave stale `is_online=true` rows unhealed.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| if (connected.length === 0) { | ||
| console.warn( | ||
| "[sync-presence] empty connected set; skipping reconcile to avoid mass-flip", | ||
| ); | ||
| return Response.json({ | ||
| connected: 0, | ||
| flippedOn: 0, | ||
| flippedOff: 0, | ||
| skipped: true, | ||
| }); | ||
| } |
There was a problem hiding this comment.
P1: Skipping reconciliation when connected is empty breaks the legitimate “all hosts offline” case and can leave stale is_online=true rows unhealed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/app/api/hosts/jobs/sync-presence/route.ts, line 63:
<comment>Skipping reconciliation when `connected` is empty breaks the legitimate “all hosts offline” case and can leave stale `is_online=true` rows unhealed.</comment>
<file context>
@@ -54,29 +57,55 @@ export async function POST(request: Request): Promise<Response> {
+ // Refuse to mass-flip when the directory comes back empty — most likely a
+ // misconfigured KV credential or a wiped key, not a real zero-host state.
+ // The relay's event-driven setOnline writes still cover genuine disconnects.
+ if (connected.length === 0) {
+ console.warn(
+ "[sync-presence] empty connected set; skipping reconcile to avoid mass-flip",
</file context>
| if (connected.length === 0) { | |
| console.warn( | |
| "[sync-presence] empty connected set; skipping reconcile to avoid mass-flip", | |
| ); | |
| return Response.json({ | |
| connected: 0, | |
| flippedOn: 0, | |
| flippedOff: 0, | |
| skipped: true, | |
| }); | |
| } | |
| // Allow empty directory snapshots so reconciliation can flip all hosts offline when none are connected. |
Changes since v0.2.15: - workspaces: `superset workspaces list` table now shows the workspace ID column. (#4463) - docs: VHS demo walkthrough recording added under `packages/cli/demo/`. (#4461) - auth: `superset auth login --api-key sk_live_…` stores an API key at `~/.superset/config.json` instead of running the OAuth flow; `whoami` and `logout` updated to recognize stored keys. (#4472) - hosts: `superset hosts list` shows `local` for the host machine you're invoking from, distinct from `online`/`no`. (#4476) - automations: `--agent` is now the host agent instance/preset id (e.g. `claude`, `codex`, `superset`) and is resolved live from the host on every run. The `--agent-config-file` flag is removed; create/update rename `agentConfig` -> `agent` end-to-end. (#4481) Push cli-v0.2.16 after this lands to fire the release pipeline.
Summary
Adds a safety net for stale
v2_hosts.is_onlinerows. Today the boolean is written event-driven by the relay callinghost.setOnlineon tunnel open/close (3-retry, then silently dropped). When the relay crashes orhost.setOnlineexhausts its retries, the DB staysis_online = trueindefinitely with no self-healing — and automation dispatch (packages/trpc/src/router/automation/dispatch.ts), the desktop popovers, port scans, PR target derivation, and host badges all gate behavior on this boolean. Agents then route work at unreachable hosts and get confused, which is the user-visible bug this addresses.Approach
The relay already maintains the ground-truth connected set in Upstash (
relay:tunnel-ttlsorted set, refreshed on every pong, evicted bysweepStaleafter a 90s grace window — seeapps/relay/src/directory.ts). The new reconciler is a QStash-triggered route atPOST /api/hosts/jobs/sync-presencethat:relay:tunnel-ttldirectly from the same Upstash the relay uses (API already has the credentials).v2_hosts.is_onlinein a single CTE-drivenUPDATE ... IS DISTINCT FROMusingANY(\$1::text[])for membership.flippedOn/flippedOfffor observability.A QStash schedule fires this every minute. Combined with the relay's existing 90s TTL + 30s sweep, a stuck row corrects within ~3 minutes worst case.
The existing event-driven
host.setOnlinewrite stays in place — it handles the common-case sub-second transition. The reconciler only catches the failure modes (relay crash, API write failures).Why direct-Upstash instead of a new relay endpoint
Earlier in the design we considered adding a
GET /internal/presenceto the relay with a shared-secret auth. Dropped because:KV_REST_API_URL/KV_REST_API_TOKENenv vars (used byapps/api/src/app/api/chat/tools/web-search/route.tsfor rate limiting).relay:tunnel-ttl) is small and documented inline.Other changes in this PR
packages/cli/src/commands/hosts/list/command.ts— relabels offline rows aslocalwhenrow.idmatchesgetHostId(). Gives users running the daemon on the same machine a clearer "running but not relay-connected" signal versus a genericno. Other clients still seenofor that host, which is the correct answer for them — they genuinely can't reach it.apps/relay/fly.staging.toml— config for a parallel staging Fly app (superset-relay-staging). Already deployed, targeted at individual users via the existingrelay-url-overridePostHog flag (CSP allowshttps://relay-backup.superset.shafter feat(desktop): allow relay-backup.superset.sh in renderer CSP #4473). Documents the staging setup so future test infra is reproducible.Test plan
superset-relay-staging.fly.dev), sandbox host registered there via the PostHogrelay-url-overrideflag.connectedcount.false(flippedOff> 0 in the response).v2_hostsrows for known-offline hosts to confirm they converge.Out of scope
v2_hosts.is_onlinestays a boolean; the reconciler keeps the value honest.dispatch.ts, popovers, port scans, badges) keep readingis_onlineunchanged.Summary by cubic
Adds a scheduled reconciler to sync host online status with the relay’s Upstash directory, fixing stuck-online rows and adding guardrails to avoid mass flips. This self-heals drift so routing, automation, and UI stop targeting unreachable machines.
Bug Fixes
POST /api/hosts/jobs/sync-presencereads Upstashrelay:tunnel-ttland diffs intov2_hosts.is_online; returnsconnected,flippedOn,flippedOff.UPDATEin try/catch and log; log on signature-verify failures.host.setOnlineas primary; this job repairs relay/API failure cases.New Features
hosts list: offline rows showlocalwhenrow.idmatchesgetHostId().apps/relay/fly.staging.tomlfor a staging relay app (superset-relay-staging) used via therelay-url-overrideflag.Written for commit c4ad013. Summary will update on new commits.
Summary by CodeRabbit
New Features
Chores