feat(api): reconcile host online status from relay directory by saddlepaddle · Pull Request #4476 · superset-sh/superset

saddlepaddle · 2026-05-13T00:18:11Z

Summary

Adds a safety net for stale v2_hosts.is_online rows. Today the boolean is written event-driven by the relay calling host.setOnline on tunnel open/close (3-retry, then silently dropped). When the relay crashes or host.setOnline exhausts its retries, the DB stays is_online = true indefinitely with no self-healing — and automation dispatch (packages/trpc/src/router/automation/dispatch.ts), the desktop popovers, port scans, PR target derivation, and host badges all gate behavior on this boolean. Agents then route work at unreachable hosts and get confused, which is the user-visible bug this addresses.

Approach

The relay already maintains the ground-truth connected set in Upstash (relay:tunnel-ttl sorted set, refreshed on every pong, evicted by sweepStale after a 90s grace window — see apps/relay/src/directory.ts). The new reconciler is a QStash-triggered route at POST /api/hosts/jobs/sync-presence that:

Reads relay:tunnel-ttl directly from the same Upstash the relay uses (API already has the credentials).
Diffs against v2_hosts.is_online in a single CTE-driven UPDATE ... IS DISTINCT FROM using ANY(\$1::text[]) for membership.
Returns counts of flippedOn / flippedOff for observability.

A QStash schedule fires this every minute. Combined with the relay's existing 90s TTL + 30s sweep, a stuck row corrects within ~3 minutes worst case.

The existing event-driven host.setOnline write stays in place — it handles the common-case sub-second transition. The reconciler only catches the failure modes (relay crash, API write failures).

Why direct-Upstash instead of a new relay endpoint

Earlier in the design we considered adding a GET /internal/presence to the relay with a shared-secret auth. Dropped because:

The API already has matching KV_REST_API_URL / KV_REST_API_TOKEN env vars (used by apps/api/src/app/api/chat/tools/web-search/route.ts for rate limiting).
One fewer network hop, one fewer env var to manage across two services, no shared secret to provision.
The coupling cost (knowing the key shape relay:tunnel-ttl) is small and documented inline.

Other changes in this PR

packages/cli/src/commands/hosts/list/command.ts — relabels offline rows as local when row.id matches getHostId(). Gives users running the daemon on the same machine a clearer "running but not relay-connected" signal versus a generic no. Other clients still see no for that host, which is the correct answer for them — they genuinely can't reach it.
apps/relay/fly.staging.toml — config for a parallel staging Fly app (superset-relay-staging). Already deployed, targeted at individual users via the existing relay-url-override PostHog flag (CSP allows https://relay-backup.superset.sh after feat(desktop): allow relay-backup.superset.sh in renderer CSP #4473). Documents the staging setup so future test infra is reproducible.

Test plan

Staging relay deployed (superset-relay-staging.fly.dev), sandbox host registered there via the PostHog relay-url-override flag.
After merge, watch the first QStash run via the Upstash dashboard or Vercel logs — confirm 2xx response and reasonable connected count.
Force-kill the staging relay machine to manufacture stuck-online entries; confirm next reconciler run flips them to false (flippedOff > 0 in the response).
Inspect a sample of v2_hosts rows for known-offline hosts to confirm they converge.

Out of scope

No schema changes. v2_hosts.is_online stays a boolean; the reconciler keeps the value honest.
No Electric/desktop reactive changes — reconciler updates flow through the normal Electric WAL fan-out, which is already at scale-acceptable volume since drift corrections are rare in steady state.
All consumers (dispatch.ts, popovers, port scans, badges) keep reading is_online unchanged.

Summary by cubic

Adds a scheduled reconciler to sync host online status with the relay’s Upstash directory, fixing stuck-online rows and adding guardrails to avoid mass flips. This self-heals drift so routing, automation, and UI stop targeting unreachable machines.

Bug Fixes
- QStash-triggered POST /api/hosts/jobs/sync-presence reads Upstash relay:tunnel-ttl and diffs into v2_hosts.is_online; returns connected, flippedOn, flippedOff.
- Reliability: skip writes when the directory is empty to avoid mass-offline; wrap the reconcile UPDATE in try/catch and log; log on signature-verify failures.
- Runs every minute; with the relay’s 90s TTL + sweep, drift corrects in ~3 minutes.
- Keeps event-driven host.setOnline as primary; this job repairs relay/API failure cases.
New Features
- CLI hosts list: offline rows show local when row.id matches getHostId().
- Added apps/relay/fly.staging.toml for a staging relay app (superset-relay-staging) used via the relay-url-override flag.

^{Written for commit c4ad013. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- Host presence sync endpoint to reconcile online/offline status with connected tunnels, including signature verification and safety checks to avoid mass-flips.
- CLI hosts list now marks the current machine as "local" instead of "no".
Chores
- Added staging deployment configuration for the relay service.

Adds a QStash-triggered route at /api/hosts/jobs/sync-presence that reads the relay's Upstash tunnel-TTL sorted set and diffs against v2_hosts.is_online, flipping any rows where DB state has drifted from the actual connected set. Existing behavior — relay calls host.setOnline on tunnel open/close — remains the primary path; this is the safety net for relay crashes and host.setOnline retry-exhaustion, both of which left rows stuck at is_online=true with no self-healing mechanism. Reconciler is scheduled every minute via QStash. Reads relay:tunnel-ttl directly from Upstash rather than hopping through a new relay HTTP endpoint — the API already has the same KV credentials. Also: - cli `hosts list` relabels disconnected rows as "local" when the row matches the local machine via getHostId() — gives users running the daemon on the same box a clearer "running but not relay-connected" signal versus a generic "no". - apps/relay/fly.staging.toml: config for a parallel staging Fly app (superset-relay-staging), targeted at individual users via the existing relay-url-override PostHog flag for E2E testing.

capy-ai · 2026-05-13T00:18:22Z

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

coderabbitai · 2026-05-13T00:18:24Z

📝 Walkthrough

Walkthrough

Adds a POST API endpoint that reconciles v2_hosts.is_online from connected tunnel entries in Upstash Redis, a Fly staging config for the relay, and a CLI change that marks the local host as "local" in hosts list output.

Changes

Host Presence Sync Infrastructure

Layer / File(s)	Summary
Sync presence API endpoint `apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`	POST handler enforces Qstash signature verification (non-dev), reads connected tunnel IDs from Upstash Redis with score-based range filtering, early-exits when the connected set is empty, updates `v2_hosts.is_online` via a CTE-based bulk UPDATE that only flips differing rows, and returns connected plus flipped-on/flipped-off counts.
Relay staging deployment configuration `apps/relay/fly.staging.toml`	Fly.io staging config sets app name and region, specifies Dockerfile build source, sets `RELAY_PORT`, enforces HTTPS with an HTTP health check at `/health`, limits connection concurrency, and provisions VM resources (4GB memory, performance CPU, 2 CPUs).

CLI Local Host Display

Layer / File(s)	Summary
Local host identification in list output `packages/cli/src/commands/hosts/list/command.ts`	Import `getHostId` and update the online column to return `"local"` when the host ID matches the current local host, otherwise show `"no"` for offline hosts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"I twitch my whiskers at the logs,
Redis hums the tunnel song,
Local host glows on the list,
Staging wings prepare to throng—
A rabbit cheers: sync done, hop along!" 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(api): reconcile host online status from relay directory' accurately and concisely summarizes the main change—adding a reconciler route to sync host online status from the relay's Upstash directory.
Description check	✅ Passed	The description is comprehensive and well-structured, covering motivation (stale is_online), approach (QStash-triggered reconciler reading Upstash), design rationale, test plan, and scope clearly, with all template sections addressed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch host-device-relay-status

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

apps/api/src/app/api/hosts/jobs/sync-presence/route.ts (1)
36-37: ⚡ Quick win

Use request.url for QStash signature verification to prevent config drift.

The JWT 'sub' claim must match the actual endpoint URL. Using request.url ensures verification against the real incoming request rather than a potentially mismatched environment variable, especially across different deployment environments or behind proxies.
Proposed fix
-				url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`,
+				url: request.url,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 36 - 37,
The QStash signature verification is currently using env.NEXT_PUBLIC_API_URL to
build the expected request URL which can drift from the actual incoming URL;
update the verification logic in the route handler that checks the JWT 'sub'
claim so it uses request.url instead of env.NEXT_PUBLIC_API_URL (i.e., replace
uses of env.NEXT_PUBLIC_API_URL when constructing the expected 'sub' URL in the
QStash signature verification code inside the sync-presence route) so the JWT
'sub' matches the real incoming request URL every time.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`:
- Around line 36-37: The QStash signature verification is currently using
env.NEXT_PUBLIC_API_URL to build the expected request URL which can drift from
the actual incoming URL; update the verification logic in the route handler that
checks the JWT 'sub' claim so it uses request.url instead of
env.NEXT_PUBLIC_API_URL (i.e., replace uses of env.NEXT_PUBLIC_API_URL when
constructing the expected 'sub' URL in the QStash signature verification code
inside the sync-presence route) so the JWT 'sub' matches the real incoming
request URL every time.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 981fe690-ecdf-4f49-b020-0141a9f37bec

📥 Commits

Reviewing files that changed from the base of the PR and between d5f83b9 and 1dc295a.

📒 Files selected for processing (3)

apps/api/src/app/api/hosts/jobs/sync-presence/route.ts
apps/relay/fly.staging.toml
packages/cli/src/commands/hosts/list/command.ts

greptile-apps · 2026-05-13T00:24:18Z

Greptile Summary

This PR adds a QStash-scheduled presence reconciler (POST /api/hosts/jobs/sync-presence) that reads the relay's live-host sorted set from Upstash and bulk-updates v2_hosts.is_online for any rows that have drifted. It also adds a staging relay Fly config and a CLI improvement that labels the local machine's host as local rather than no when it is offline from the relay's perspective.

sync-presence/route.ts: New cron-driven route that diffs relay:tunnel-ttl membership (scores \u2265 Date.now()) against DB rows using a single CTE + IS DISTINCT FROM update; the key format (organizationId:machineId) correctly matches buildHostRoutingKey from @superset/shared/host-routing, and QStash signature verification is in place for production.
fly.staging.toml: Fly app config for a parallel staging relay instance, mirroring production concurrency/health-check settings.
packages/cli/src/commands/hosts/list/command.ts: Compares row.id (the machineId HMAC) against getHostId() to surface a local status \u2014 the identifier types align correctly.

Confidence Score: 3/5

The reconciler logic and key format are correct, but a DB failure during the update throws an unhandled exception with no log output, and a missing or wrong Redis key would silently flip every host offline.

The core logic is sound and the key format aligns with buildHostRoutingKey. However, the db.execute block lacks error handling, and redis.zrange on a missing key returns [] without error, so a misconfigured KV URL would cause a bulk offline flip of every host with no observability signal.

apps/api/src/app/api/hosts/jobs/sync-presence/route.ts needs a try-catch around db.execute and a guard for the empty-connected-array case before the route is safe to run every minute in production.

Important Files Changed

Filename	Overview
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts	New QStash-triggered reconciler that diffs Upstash sorted-set membership against v2_hosts.is_online; Redis read is guarded but the DB update has no error handling, and an absent key silently bulk-flips all hosts offline.
apps/relay/fly.staging.toml	Staging Fly app config for superset-relay-staging; mirrors the production relay shape with identical health-check and concurrency settings.
packages/cli/src/commands/hosts/list/command.ts	Adds a 'local' label for the current machine's host entry when is_online is false; compares row.id (machineId HMAC) against getHostId() which is the correct identifier.

Sequence Diagram

sequenceDiagram
    participant QS as QStash Scheduler
    participant API as POST /api/hosts/jobs/sync-presence
    participant UP as Upstash Redis
    participant DB as PostgreSQL (v2_hosts)

    QS->>API: POST (with upstash-signature)
    API->>API: Verify QStash signature
    API->>UP: ZRANGEBYSCORE relay:tunnel-ttl Date.now() +inf
    UP-->>API: connected[] (organizationId:machineId members)
    API->>DB: "UPDATE v2_hosts SET is_online = (id IN connected) WHERE is_online IS DISTINCT FROM expected"
    DB-->>API: flipped rows
    API-->>QS: "200 { connected, flippedOn, flippedOff }"

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:57-76
**Unhandled DB failure crashes the reconciler**

The Redis call above has a `try/catch` that returns a `502`, but the `db.execute` block has none. A transient DB connection error (or a Postgres type error if `connected` somehow produces a malformed `ANY` cast) will throw an unhandled exception — causing Next.js to return a generic 500 with no log message and no `flippedOn`/`flippedOff` counts. QStash will retry into the same silent failure, making this very hard to diagnose. It should be wrapped the same way the Redis read is.

### Issue 2 of 2
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:44-55
**Empty `connected` set silently marks every host offline**

`redis.zrange` on a missing key returns `[]` without throwing — so if `relay:tunnel-ttl` is absent (wrong `KV_REST_API_URL` secret, fresh deployment, accidental key deletion), the reconciler successfully receives an empty array and the `UPDATE` query flips every host to `is_online = false`. There's no way to distinguish "no hosts are genuinely connected" from "wrong Redis instance." Consider logging a warning or returning early when `connected.length === 0`, or adding a sanity threshold before committing the write.

_{Reviews (1): Last reviewed commit: "feat(api): reconcile host online status ..." | Re-trigger Greptile}

cubic-dev-ai

1 issue found across 3 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts">

<violation number="1" location="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:38">
P2: Do not silently swallow signature verification errors; log the caught error before returning `false` so real verification failures are observable.

(Based on your team's feedback about surfacing async failures instead of empty catches.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

greptile-apps · 2026-05-13T00:24:22Z

+	const result = await db.execute<{
+		organization_id: string;
+		machine_id: string;
+		is_online: boolean;
+	}>(sql`
+		WITH desired AS (
+			SELECT
+				organization_id,
+				machine_id,
+				(organization_id::text || ':' || machine_id) = ANY(${connected}::text[]) AS expected
+			FROM v2_hosts
+		)
+		UPDATE v2_hosts h
+		SET is_online = d.expected
+		FROM desired d
+		WHERE h.organization_id = d.organization_id
+			AND h.machine_id = d.machine_id
+			AND h.is_online IS DISTINCT FROM d.expected
+		RETURNING h.organization_id, h.machine_id, h.is_online
+	`);


Unhandled DB failure crashes the reconciler

The Redis call above has a try/catch that returns a 502, but the db.execute block has none. A transient DB connection error (or a Postgres type error if connected somehow produces a malformed ANY cast) will throw an unhandled exception — causing Next.js to return a generic 500 with no log message and no flippedOn/flippedOff counts. QStash will retry into the same silent failure, making this very hard to diagnose. It should be wrapped the same way the Redis read is.

Prompt To Fix With AI

This is a comment left during a code review. Path: apps/api/src/app/api/hosts/jobs/sync-presence/route.ts Line: 57-76 Comment: **Unhandled DB failure crashes the reconciler** The Redis call above has a `try/catch` that returns a `502`, but the `db.execute` block has none. A transient DB connection error (or a Postgres type error if `connected` somehow produces a malformed `ANY` cast) will throw an unhandled exception — causing Next.js to return a generic 500 with no log message and no `flippedOn`/`flippedOff` counts. QStash will retry into the same silent failure, making this very hard to diagnose. It should be wrapped the same way the Redis read is. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-13T00:24:24Z

+	let connected: string[];
+	try {
+		connected = await redis.zrange<string[]>(
+			RELAY_TTL_KEY,
+			Date.now(),
+			"+inf",
+			{ byScore: true },
+		);
+	} catch (error) {
+		console.error("[sync-presence] redis read failed:", error);
+		return Response.json({ error: "Directory read failed" }, { status: 502 });
+	}


Empty connected set silently marks every host offline

redis.zrange on a missing key returns [] without throwing — so if relay:tunnel-ttl is absent (wrong KV_REST_API_URL secret, fresh deployment, accidental key deletion), the reconciler successfully receives an empty array and the UPDATE query flips every host to is_online = false. There's no way to distinguish "no hosts are genuinely connected" from "wrong Redis instance." Consider logging a warning or returning early when connected.length === 0, or adding a sanity threshold before committing the write.

Prompt To Fix With AI

This is a comment left during a code review. Path: apps/api/src/app/api/hosts/jobs/sync-presence/route.ts Line: 44-55 Comment: **Empty `connected` set silently marks every host offline** `redis.zrange` on a missing key returns `[]` without throwing — so if `relay:tunnel-ttl` is absent (wrong `KV_REST_API_URL` secret, fresh deployment, accidental key deletion), the reconciler successfully receives an empty array and the `UPDATE` query flips every host to `is_online = false`. There's no way to distinguish "no hosts are genuinely connected" from "wrong Redis instance." Consider logging a warning or returning early when `connected.length === 0`, or adding a sanity threshold before committing the write. How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-05-13T00:25:03Z

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

✅ Neon database branch

Thank you for your contribution! 🎉

- Wrap the reconcile UPDATE in try/catch and log on failure. Without this, a transient DB error throws into Next.js as an opaque 500, QStash silently retries into the same failure, no flippedOn/flippedOff to anchor diagnosis. - Refuse to write when the Upstash directory returns an empty connected set. An absent or misrouted relay:tunnel-ttl key would otherwise silently flip every host to is_online=false. The relay's event-driven setOnline path still handles genuine disconnects, so skipping here is safe. - Log on the QStash signature verify catch instead of swallowing the error.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

apps/api/src/app/api/hosts/jobs/sync-presence/route.ts (2)
86-100: ⚖️ Poor tradeoff

Reconciliation does a full v2_hosts scan every minute.

The CTE SELECT … FROM v2_hosts materialises every row to compute expected, regardless of how many are actually misaligned. For a small fleet this is fine, but at scale (100k+ hosts × 1 run/min) it becomes a recurring full-table read on a hot OLTP table. Two cheap improvements you might consider before this becomes a problem:

Restrict the CTE to candidate rows only: WHERE is_online = true OR (organization_id::text || ':' || machine_id) = ANY(...) — that limits the scan to currently-online hosts plus the connected set, which is the actual diff surface.

Composite index on v2_hosts (organization_id, machine_id) if one doesn't already exist (the join in the UPDATE depends on it being indexed for the planner to avoid a second seq scan).

Worth flagging now since the QStash schedule is fixed at 1/min; if you expect host count to grow an order of magnitude, this is the path that gets expensive first.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 86 -
100, The CTE named desired scans all rows in v2_hosts each run; restrict it to
the actual candidate set by adding a WHERE clause to the CTE such as WHERE
is_online = true OR (organization_id::text || ':' || machine_id) =
ANY(${connected}::text[]) so only currently-online hosts and the connected set
are materialized before the UPDATE, and ensure there is a composite index on
(organization_id, machine_id) (create an index if none exists) so the UPDATE ...
FROM desired join on organization_id and machine_id uses the index rather than
triggering a second seq scan.
32-37: 💤 Low value

QStash signature verification URL vulnerable to NEXT_PUBLIC_API_URL formatting with trailing slashes.

QStash signs the exact URL it POSTs to. If NEXT_PUBLIC_API_URL is configured with a trailing slash (e.g., https://api.example.com/), the constructed URL becomes malformed (https://…//api/hosts/jobs/sync-presence) and signature verification fails silently.

The env validation (z.string().url()) accepts but does not normalize trailing slashes. A normalization pattern already exists elsewhere in the codebase (apps/api/src/trpc/context.ts line 9). Consider normalizing NEXT_PUBLIC_API_URL at the env level or at the verification call site:
♻️ Defensive normalisation
+const API_BASE = env.NEXT_PUBLIC_API_URL.replace(/\/+$/, "");
 ...
-      url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`,
+      url: `${API_BASE}/api/hosts/jobs/sync-presence`,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 32 - 37,
The QStash signature verification uses receiver.verify with url:
`${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`, which breaks if
NEXT_PUBLIC_API_URL has a trailing slash; normalize the base URL before
verification (either at env parsing or immediately before calling
receiver.verify) by removing any trailing slash (e.g., use a trimEnd('/')
equivalent) and then build the full path, ensuring receiver.verify receives the
exact URL QStash signed; update references in route.ts around receiver.verify
and/or the env parsing to use the normalized NEXT_PUBLIC_API_URL.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`:
- Around line 49-54: The handler reads bare hostId values into connected from
RELAY_TTL_KEY but later compares against SQL rows built as
organization_id:machine_id, so expected will always be false; fix by
reverse-mapping each hostId using the tunnel-owner HSET (the "tunnel-owner" key)
to obtain the corresponding machine_id/organization_id pair and reconstruct the
compound key before comparison (use redis.hget/hmget on the tunnel-owner mapping
for the hostId(s) retrieved into connected and build
`${organization_id}:${machine_id}` to compare with the SQL identifier), or
alternatively change relay's zadd in apps/relay/src/directory.ts to store
`${organizationId}:${hostId}` as the member so the formats match—implement one
of these two fixes and keep RELAY_TTL_KEY/connected usage consistent with the
chosen mapping approach.
- Around line 47-58: Add explicit per-driver timeouts: recreate or ensure the
Redis client used by this route is constructed with a signal (e.g. new Redis({
..., signal: AbortSignal.timeout(5000) })) so all redis.zrange calls have a 5s
deadline, and replace any existing redis client usage in this file (including
the other call sites around lines 80-105) to use that client; for DB calls
(db.execute in this file) wrap the execute call with an AbortSignal timeout or
add a SQL statement timeout so the query has a bounded deadline (e.g. call
db.execute within AbortSignal.timeout(...) or include a query-level
statement_timeout), and ensure the catch blocks still handle and log AbortErrors
consistently.

---

Nitpick comments:
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`:
- Around line 86-100: The CTE named desired scans all rows in v2_hosts each run;
restrict it to the actual candidate set by adding a WHERE clause to the CTE such
as WHERE is_online = true OR (organization_id::text || ':' || machine_id) =
ANY(${connected}::text[]) so only currently-online hosts and the connected set
are materialized before the UPDATE, and ensure there is a composite index on
(organization_id, machine_id) (create an index if none exists) so the UPDATE ...
FROM desired join on organization_id and machine_id uses the index rather than
triggering a second seq scan.
- Around line 32-37: The QStash signature verification uses receiver.verify with
url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`, which breaks if
NEXT_PUBLIC_API_URL has a trailing slash; normalize the base URL before
verification (either at env parsing or immediately before calling
receiver.verify) by removing any trailing slash (e.g., use a trimEnd('/')
equivalent) and then build the full path, ensuring receiver.verify receives the
exact URL QStash signed; update references in route.ts around receiver.verify
and/or the env parsing to use the normalized NEXT_PUBLIC_API_URL.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 408bb6a9-dcf7-4fe2-be5f-38c453129cee

📥 Commits

Reviewing files that changed from the base of the PR and between 1dc295a and c4ad013.

📒 Files selected for processing (1)

apps/api/src/app/api/hosts/jobs/sync-presence/route.ts

coderabbitai · 2026-05-13T01:01:55Z

+	let connected: string[];
+	try {
+		connected = await redis.zrange<string[]>(
+			RELAY_TTL_KEY,
+			Date.now(),
+			"+inf",
+			{ byScore: true },
+		);
+	} catch (error) {
+		console.error("[sync-presence] redis read failed:", error);
+		return Response.json({ error: "Directory read failed" }, { status: 502 });
+	}


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

@upstash/redis 1.37 zrange AbortSignal option

💡 Result:

@upstash/redis (upstash/redis-js) does not support an AbortSignal option on the ZRANGE command options (e.g., there is no zrange(..., { AbortSignal:... }) or zrange(..., { signal:... }) supported in the command’s option type). What it does support is passing an AbortSignal to the Redis client itself, via the client configuration. The Node.js platform config includes a signal option (“The signal will allow aborting requests on the fly”) [1]. Additionally, the underlying HTTP request layer for commands accepts signal?: AbortSignal [2], and the command execution passes signal: this.signal into the request [3]. How to use (example) 1) Create the client with a signal const controller = new AbortController; const redis = new Redis({ url: process.env.UPSTASH_REDIS_REST_URL!, token: process.env.UPSTASH_REST_TOKEN!, signal: controller.signal, }); // cancel in-flight request controller.abort; This pattern is shown directly in Upstash’s client configuration docs [4]. 2) Or set a timeout signal (example uses AbortSignal.timeout) Upstash’s advanced docs show configuring the SDK with signal: () => AbortSignal.timeout(1000) [5]. So for your exact query (“@upstash/redis 1.37 zrange AbortSignal option”): use client-level signal, not a per-command zrange option; zrange options in code include only ZRANGE-specific flags like withScores, rev, byScore, byLex, and pagination (offset/count) [6], and Upstash’s ZRANGE docs list only those ZRANGE arguments (with no AbortSignal mention) [7].

Citations:

1: https://github.com/upstash/redis-js/blob/fbead4fb/platforms/nodejs.ts

2: https://github.com/upstash/redis-js/blob/fbead4fb/pkg/http.ts

3: https://github.com/upstash/redis-js/blob/fbead4fb/pkg/commands/command.ts

4: https://mintlify.com/upstash/redis-js/concepts/client-configuration

5: https://upstash.com/docs/redis/sdks/ts/advanced

6: https://github.com/upstash/redis-js/blob/fbead4fb/pkg/commands/zrange.ts

7: https://upstash.com/docs/redis/sdks/ts/commands/zset/zrange

🏁 Script executed:

# Check package.json for `@upstash/redis` and drizzle versions find . -name "package.json" -type f | head -5 | xargs grep -l "@upstash/redis\|drizzle"

Repository: superset-sh/superset

Length of output: 177

🏁 Script executed:

# Read the complete route file to understand the implementation cat -n apps/api/src/app/api/hosts/jobs/sync-presence/route.ts

Repository: superset-sh/superset

Length of output: 3984

🏁 Script executed:

# Check for Redis client initialization in the codebase fd -e ts -e tsx | xargs grep -l "new Redis\|Redis({" | head -10

Repository: superset-sh/superset

Length of output: 279

🏁 Script executed:

# Check `@upstash/redis` and drizzle versions in package.json cat apps/api/package.json | grep -A 2 -B 2 "@upstash/redis\|drizzle-orm"

Repository: superset-sh/superset

Length of output: 315

🏁 Script executed:

# Check how Redis client is used elsewhere in the codebase rg "redis\.(zrange|execute)" --type ts --type tsx -B 2 -A 2

Repository: superset-sh/superset

Length of output: 92

🏁 Script executed:

# Look for any existing timeout patterns in the codebase rg "AbortSignal|statement_timeout|timeout" --type ts --type tsx apps/api | head -20

Repository: superset-sh/superset

Length of output: 92

🏁 Script executed:

# Search for redis usage patterns more broadly rg "redis\." apps/api --type ts -B 1 -A 1 | head -30

Repository: superset-sh/superset

Length of output: 298

🏁 Script executed:

# Check Drizzle documentation patterns and db.execute in codebase rg "db\.execute" apps/api --type ts -B 2 -A 2

Repository: superset-sh/superset

Length of output: 438

🏁 Script executed:

# Look for any timeout patterns in API routes rg "timeout|AbortController|AbortSignal" apps/api --type ts

Repository: superset-sh/superset

Length of output: 46

🏁 Script executed:

# Check Drizzle db client configuration in the codebase find . -name "*db*" -type f -path "*/packages/db/*" | head -10

Repository: superset-sh/superset

Length of output: 119

🏁 Script executed:

# Look for db client initialization rg "createPool|postgres\(" --type ts -B 2 -A 5 | head -40

Repository: superset-sh/superset

Length of output: 46

🏁 Script executed:

# Find the db client file find . -path "*/db/client*" -type f

Repository: superset-sh/superset

Length of output: 46

🏁 Script executed:

# Find db package structure ls -la packages/db/

Repository: superset-sh/superset

Length of output: 485

🏁 Script executed:

# Check for database configuration rg "new Database|new Pool|postgres\(" packages/ --type ts -B 1 -A 3 | head -50

Repository: superset-sh/superset

Length of output: 1693

🏁 Script executed:

# Read the db client configuration cat packages/db/src/client.ts

Repository: superset-sh/superset

Length of output: 619

🏁 Script executed:

# Check drizzle-orm types for execute method rg "execute.*function\|execute.*Promise" packages/ --type ts -B 2 -A 5 | head -50

Repository: superset-sh/superset

Length of output: 46

🏁 Script executed:

# Check how db is actually imported in the sync-presence route head -10 apps/api/src/app/api/hosts/jobs/sync-presence/route.ts

Repository: superset-sh/superset

Length of output: 329

🏁 Script executed:

# Check if there are any other examples of timeout handling in the codebase rg "controller.*abort\|AbortController" . --type ts -B 2 -A 2

Repository: superset-sh/superset

Length of output: 46

🏁 Script executed:

# Verify what db import path resolves to and check for timeout patterns rg "@superset/db" apps/ --type ts -B 1 -A 1 | head -20

Repository: superset-sh/superset

Length of output: 1378

Add explicit timeouts to the Redis and DB calls with correct per-driver approaches.

QStash will retry on non-2xx, but a hung request can sit on a Node handler indefinitely and tie up retries behind it. Neither redis.zrange nor db.execute here has a deadline. However, the implementation approach depends on the drivers:

Redis (@upstash/redis 1.37.0): Does not support AbortSignal in zrange options. Instead, pass signal to the Redis client constructor: new Redis({ ..., signal: AbortSignal.timeout(5000) }). This applies to all commands.

Database (Neon serverless + Drizzle 0.45.2): The Neon serverless driver is HTTP-based. Use SQL-level statement timeouts within the query or wrap with AbortSignal.timeout() on the db.execute() call itself.

A bounded timeout keeps the reconciler self-healing under upstream slowness.

Also applies to: 80-105

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 47 - 58, Add explicit per-driver timeouts: recreate or ensure the Redis client used by this route is constructed with a signal (e.g. new Redis({ ..., signal: AbortSignal.timeout(5000) })) so all redis.zrange calls have a 5s deadline, and replace any existing redis client usage in this file (including the other call sites around lines 80-105) to use that client; for DB calls (db.execute in this file) wrap the execute call with an AbortSignal timeout or add a SQL statement timeout so the query has a bounded deadline (e.g. call db.execute within AbortSignal.timeout(...) or include a query-level statement_timeout), and ensure the catch blocks still handle and log AbortErrors consistently.

coderabbitai · 2026-05-13T01:01:55Z

+		connected = await redis.zrange<string[]>(
+			RELAY_TTL_KEY,
+			Date.now(),
+			"+inf",
+			{ byScore: true },
+		);


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the relay's directory contract matches this consumer. fd -t f 'directory\.(ts|js)$' apps/relay rg -nP -C3 'tunnel-ttl|relay:tunnel' apps/relay # Look at how scores are computed and how members are formatted. rg -nP -C5 '\b(zadd|ZADD|expireAt|score)\b' apps/relay rg -nP -C3 'machine[_-]?id.*organization|organization.*machine' apps/relay/src

Repository: superset-sh/superset

Length of output: 4490

Critical mismatch: relay writes hostId members, but SQL expects organization_id:machine_id format.

The relay stores ZSET members as bare hostId values (line 111 in apps/relay/src/directory.ts):

redis.zadd(TTL_KEY, { score: now + TTL_GRACE_MS, member: hostId })

This handler retrieves that array into connected and then attempts to match it against a SQL WHERE clause that builds identifiers as organization_id || ':' || machine_id. Since the formats do not match, every row will have expected = false and all connected hosts will be flipped offline silently.

Fix: Either

Have the relay write members as ${organizationId}:${hostId} (requires changes to apps/relay/src/directory.ts register and sweep), or

Parse the hostId from connected[] to look up the corresponding organization_id in a separate index before comparing, or

Use the tunnel-owner HSET to reverse-map hostId → machineId and reconstruct the compound key.

The score units (milliseconds) and key name (relay:tunnel-ttl) are correct.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 49 - 54, The handler reads bare hostId values into connected from RELAY_TTL_KEY but later compares against SQL rows built as organization_id:machine_id, so expected will always be false; fix by reverse-mapping each hostId using the tunnel-owner HSET (the "tunnel-owner" key) to obtain the corresponding machine_id/organization_id pair and reconstruct the compound key before comparison (use redis.hget/hmget on the tunnel-owner mapping for the hostId(s) retrieved into connected and build `${organization_id}:${machine_id}` to compare with the SQL identifier), or alternatively change relay's zadd in apps/relay/src/directory.ts to store `${organizationId}:${hostId}` as the member so the formats match—implement one of these two fixes and keep RELAY_TTL_KEY/connected usage consistent with the chosen mapping approach.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts">

<violation number="1" location="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:63">
P1: Skipping reconciliation when `connected` is empty breaks the legitimate “all hosts offline” case and can leave stale `is_online=true` rows unhealed.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-05-13T01:02:48Z

+	if (connected.length === 0) {
+		console.warn(
+			"[sync-presence] empty connected set; skipping reconcile to avoid mass-flip",
+		);
+		return Response.json({
+			connected: 0,
+			flippedOn: 0,
+			flippedOff: 0,
+			skipped: true,
+		});
+	}


P1: Skipping reconciliation when connected is empty breaks the legitimate “all hosts offline” case and can leave stale is_online=true rows unhealed.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/app/api/hosts/jobs/sync-presence/route.ts, line 63: <comment>Skipping reconciliation when `connected` is empty breaks the legitimate “all hosts offline” case and can leave stale `is_online=true` rows unhealed.</comment> <file context> @@ -54,29 +57,55 @@ export async function POST(request: Request): Promise<Response> { + // Refuse to mass-flip when the directory comes back empty — most likely a + // misconfigured KV credential or a wiped key, not a real zero-host state. + // The relay's event-driven setOnline writes still cover genuine disconnects. + if (connected.length === 0) { + console.warn( + "[sync-presence] empty connected set; skipping reconcile to avoid mass-flip", </file context>

Suggested change

if (connected.length === 0) {

console.warn(

"[sync-presence] empty connected set; skipping reconcile to avoid mass-flip",

);

return Response.json({

connected: 0,

flippedOn: 0,

flippedOff: 0,

skipped: true,

});

}

// Allow empty directory snapshots so reconciliation can flip all hosts offline when none are connected.

Changes since v0.2.15: - workspaces: `superset workspaces list` table now shows the workspace ID column. (#4463) - docs: VHS demo walkthrough recording added under `packages/cli/demo/`. (#4461) - auth: `superset auth login --api-key sk_live_…` stores an API key at `~/.superset/config.json` instead of running the OAuth flow; `whoami` and `logout` updated to recognize stored keys. (#4472) - hosts: `superset hosts list` shows `local` for the host machine you're invoking from, distinct from `online`/`no`. (#4476) - automations: `--agent` is now the host agent instance/preset id (e.g. `claude`, `codex`, `superset`) and is resolved live from the host on every run. The `--agent-config-file` flag is removed; create/update rename `agentConfig` -> `agent` end-to-end. (#4481) Push cli-v0.2.16 after this lands to fire the release pipeline.

saddlepaddle temporarily deployed to preview May 13, 2026 00:18 — with GitHub Actions Inactive

saddlepaddle temporarily deployed to preview May 13, 2026 00:19 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 13, 2026

View reviewed changes

Comment thread apps/api/src/app/api/hosts/jobs/sync-presence/route.ts Outdated

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

saddlepaddle temporarily deployed to preview May 13, 2026 00:56 — with GitHub Actions Inactive

saddlepaddle temporarily deployed to preview May 13, 2026 00:57 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 13, 2026

View reviewed changes

saddlepaddle merged commit 9f67fb5 into main May 13, 2026
17 checks passed

This was referenced May 13, 2026

fix(api): pass connected set as array literal in sync-presence #4483

Merged

release(cli): cut v0.2.16 #4494

Merged

Conversation

saddlepaddle commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Why direct-Upstash instead of a new relay endpoint

Other changes in this PR

Test plan

Out of scope

Summary by cubic

Summary by CodeRabbit

Uh oh!

capy-ai Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 13, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧹 Preview Cleanup Complete

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

saddlepaddle commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading