Skip to content

feat(api): reconcile host online status from relay directory#4476

Merged
saddlepaddle merged 2 commits into
mainfrom
host-device-relay-status
May 13, 2026
Merged

feat(api): reconcile host online status from relay directory#4476
saddlepaddle merged 2 commits into
mainfrom
host-device-relay-status

Conversation

@saddlepaddle
Copy link
Copy Markdown
Collaborator

@saddlepaddle saddlepaddle commented May 13, 2026

Summary

Adds a safety net for stale v2_hosts.is_online rows. Today the boolean is written event-driven by the relay calling host.setOnline on tunnel open/close (3-retry, then silently dropped). When the relay crashes or host.setOnline exhausts its retries, the DB stays is_online = true indefinitely with no self-healing — and automation dispatch (packages/trpc/src/router/automation/dispatch.ts), the desktop popovers, port scans, PR target derivation, and host badges all gate behavior on this boolean. Agents then route work at unreachable hosts and get confused, which is the user-visible bug this addresses.

Approach

The relay already maintains the ground-truth connected set in Upstash (relay:tunnel-ttl sorted set, refreshed on every pong, evicted by sweepStale after a 90s grace window — see apps/relay/src/directory.ts). The new reconciler is a QStash-triggered route at POST /api/hosts/jobs/sync-presence that:

  1. Reads relay:tunnel-ttl directly from the same Upstash the relay uses (API already has the credentials).
  2. Diffs against v2_hosts.is_online in a single CTE-driven UPDATE ... IS DISTINCT FROM using ANY(\$1::text[]) for membership.
  3. Returns counts of flippedOn / flippedOff for observability.

A QStash schedule fires this every minute. Combined with the relay's existing 90s TTL + 30s sweep, a stuck row corrects within ~3 minutes worst case.

The existing event-driven host.setOnline write stays in place — it handles the common-case sub-second transition. The reconciler only catches the failure modes (relay crash, API write failures).

Why direct-Upstash instead of a new relay endpoint

Earlier in the design we considered adding a GET /internal/presence to the relay with a shared-secret auth. Dropped because:

  • The API already has matching KV_REST_API_URL / KV_REST_API_TOKEN env vars (used by apps/api/src/app/api/chat/tools/web-search/route.ts for rate limiting).
  • One fewer network hop, one fewer env var to manage across two services, no shared secret to provision.
  • The coupling cost (knowing the key shape relay:tunnel-ttl) is small and documented inline.

Other changes in this PR

  • packages/cli/src/commands/hosts/list/command.ts — relabels offline rows as local when row.id matches getHostId(). Gives users running the daemon on the same machine a clearer "running but not relay-connected" signal versus a generic no. Other clients still see no for that host, which is the correct answer for them — they genuinely can't reach it.
  • apps/relay/fly.staging.toml — config for a parallel staging Fly app (superset-relay-staging). Already deployed, targeted at individual users via the existing relay-url-override PostHog flag (CSP allows https://relay-backup.superset.sh after feat(desktop): allow relay-backup.superset.sh in renderer CSP #4473). Documents the staging setup so future test infra is reproducible.

Test plan

  • Staging relay deployed (superset-relay-staging.fly.dev), sandbox host registered there via the PostHog relay-url-override flag.
  • After merge, watch the first QStash run via the Upstash dashboard or Vercel logs — confirm 2xx response and reasonable connected count.
  • Force-kill the staging relay machine to manufacture stuck-online entries; confirm next reconciler run flips them to false (flippedOff > 0 in the response).
  • Inspect a sample of v2_hosts rows for known-offline hosts to confirm they converge.

Out of scope

  • No schema changes. v2_hosts.is_online stays a boolean; the reconciler keeps the value honest.
  • No Electric/desktop reactive changes — reconciler updates flow through the normal Electric WAL fan-out, which is already at scale-acceptable volume since drift corrections are rare in steady state.
  • All consumers (dispatch.ts, popovers, port scans, badges) keep reading is_online unchanged.

Summary by cubic

Adds a scheduled reconciler to sync host online status with the relay’s Upstash directory, fixing stuck-online rows and adding guardrails to avoid mass flips. This self-heals drift so routing, automation, and UI stop targeting unreachable machines.

  • Bug Fixes

    • QStash-triggered POST /api/hosts/jobs/sync-presence reads Upstash relay:tunnel-ttl and diffs into v2_hosts.is_online; returns connected, flippedOn, flippedOff.
    • Reliability: skip writes when the directory is empty to avoid mass-offline; wrap the reconcile UPDATE in try/catch and log; log on signature-verify failures.
    • Runs every minute; with the relay’s 90s TTL + sweep, drift corrects in ~3 minutes.
    • Keeps event-driven host.setOnline as primary; this job repairs relay/API failure cases.
  • New Features

    • CLI hosts list: offline rows show local when row.id matches getHostId().
    • Added apps/relay/fly.staging.toml for a staging relay app (superset-relay-staging) used via the relay-url-override flag.

Written for commit c4ad013. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Host presence sync endpoint to reconcile online/offline status with connected tunnels, including signature verification and safety checks to avoid mass-flips.
    • CLI hosts list now marks the current machine as "local" instead of "no".
  • Chores

    • Added staging deployment configuration for the relay service.

Review Change Stack

Adds a QStash-triggered route at /api/hosts/jobs/sync-presence that
reads the relay's Upstash tunnel-TTL sorted set and diffs against
v2_hosts.is_online, flipping any rows where DB state has drifted from
the actual connected set.

Existing behavior — relay calls host.setOnline on tunnel open/close —
remains the primary path; this is the safety net for relay crashes
and host.setOnline retry-exhaustion, both of which left rows stuck at
is_online=true with no self-healing mechanism. Reconciler is scheduled
every minute via QStash.

Reads relay:tunnel-ttl directly from Upstash rather than hopping
through a new relay HTTP endpoint — the API already has the same KV
credentials.

Also:
- cli `hosts list` relabels disconnected rows as "local" when the row
  matches the local machine via getHostId() — gives users running the
  daemon on the same box a clearer "running but not relay-connected"
  signal versus a generic "no".
- apps/relay/fly.staging.toml: config for a parallel staging Fly app
  (superset-relay-staging), targeted at individual users via the
  existing relay-url-override PostHog flag for E2E testing.
@capy-ai
Copy link
Copy Markdown

capy-ai Bot commented May 13, 2026

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

Adds a POST API endpoint that reconciles v2_hosts.is_online from connected tunnel entries in Upstash Redis, a Fly staging config for the relay, and a CLI change that marks the local host as "local" in hosts list output.

Changes

Host Presence Sync Infrastructure

Layer / File(s) Summary
Sync presence API endpoint
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts
POST handler enforces Qstash signature verification (non-dev), reads connected tunnel IDs from Upstash Redis with score-based range filtering, early-exits when the connected set is empty, updates v2_hosts.is_online via a CTE-based bulk UPDATE that only flips differing rows, and returns connected plus flipped-on/flipped-off counts.
Relay staging deployment configuration
apps/relay/fly.staging.toml
Fly.io staging config sets app name and region, specifies Dockerfile build source, sets RELAY_PORT, enforces HTTPS with an HTTP health check at /health, limits connection concurrency, and provisions VM resources (4GB memory, performance CPU, 2 CPUs).

CLI Local Host Display

Layer / File(s) Summary
Local host identification in list output
packages/cli/src/commands/hosts/list/command.ts
Import getHostId and update the online column to return "local" when the host ID matches the current local host, otherwise show "no" for offline hosts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"I twitch my whiskers at the logs,
Redis hums the tunnel song,
Local host glows on the list,
Staging wings prepare to throng—
A rabbit cheers: sync done, hop along!" 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(api): reconcile host online status from relay directory' accurately and concisely summarizes the main change—adding a reconciler route to sync host online status from the relay's Upstash directory.
Description check ✅ Passed The description is comprehensive and well-structured, covering motivation (stale is_online), approach (QStash-triggered reconciler reading Upstash), design rationale, test plan, and scope clearly, with all template sections addressed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch host-device-relay-status

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts (1)

36-37: ⚡ Quick win

Use request.url for QStash signature verification to prevent config drift.

The JWT 'sub' claim must match the actual endpoint URL. Using request.url ensures verification against the real incoming request rather than a potentially mismatched environment variable, especially across different deployment environments or behind proxies.

Proposed fix
-				url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`,
+				url: request.url,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 36 - 37,
The QStash signature verification is currently using env.NEXT_PUBLIC_API_URL to
build the expected request URL which can drift from the actual incoming URL;
update the verification logic in the route handler that checks the JWT 'sub'
claim so it uses request.url instead of env.NEXT_PUBLIC_API_URL (i.e., replace
uses of env.NEXT_PUBLIC_API_URL when constructing the expected 'sub' URL in the
QStash signature verification code inside the sync-presence route) so the JWT
'sub' matches the real incoming request URL every time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`:
- Around line 36-37: The QStash signature verification is currently using
env.NEXT_PUBLIC_API_URL to build the expected request URL which can drift from
the actual incoming URL; update the verification logic in the route handler that
checks the JWT 'sub' claim so it uses request.url instead of
env.NEXT_PUBLIC_API_URL (i.e., replace uses of env.NEXT_PUBLIC_API_URL when
constructing the expected 'sub' URL in the QStash signature verification code
inside the sync-presence route) so the JWT 'sub' matches the real incoming
request URL every time.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 981fe690-ecdf-4f49-b020-0141a9f37bec

📥 Commits

Reviewing files that changed from the base of the PR and between d5f83b9 and 1dc295a.

📒 Files selected for processing (3)
  • apps/api/src/app/api/hosts/jobs/sync-presence/route.ts
  • apps/relay/fly.staging.toml
  • packages/cli/src/commands/hosts/list/command.ts

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR adds a QStash-scheduled presence reconciler (POST /api/hosts/jobs/sync-presence) that reads the relay's live-host sorted set from Upstash and bulk-updates v2_hosts.is_online for any rows that have drifted. It also adds a staging relay Fly config and a CLI improvement that labels the local machine's host as local rather than no when it is offline from the relay's perspective.

  • sync-presence/route.ts: New cron-driven route that diffs relay:tunnel-ttl membership (scores \u2265 Date.now()) against DB rows using a single CTE + IS DISTINCT FROM update; the key format (organizationId:machineId) correctly matches buildHostRoutingKey from @superset/shared/host-routing, and QStash signature verification is in place for production.
  • fly.staging.toml: Fly app config for a parallel staging relay instance, mirroring production concurrency/health-check settings.
  • packages/cli/src/commands/hosts/list/command.ts: Compares row.id (the machineId HMAC) against getHostId() to surface a local status \u2014 the identifier types align correctly.

Confidence Score: 3/5

The reconciler logic and key format are correct, but a DB failure during the update throws an unhandled exception with no log output, and a missing or wrong Redis key would silently flip every host offline.

The core logic is sound and the key format aligns with buildHostRoutingKey. However, the db.execute block lacks error handling, and redis.zrange on a missing key returns [] without error, so a misconfigured KV URL would cause a bulk offline flip of every host with no observability signal.

apps/api/src/app/api/hosts/jobs/sync-presence/route.ts needs a try-catch around db.execute and a guard for the empty-connected-array case before the route is safe to run every minute in production.

Important Files Changed

Filename Overview
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts New QStash-triggered reconciler that diffs Upstash sorted-set membership against v2_hosts.is_online; Redis read is guarded but the DB update has no error handling, and an absent key silently bulk-flips all hosts offline.
apps/relay/fly.staging.toml Staging Fly app config for superset-relay-staging; mirrors the production relay shape with identical health-check and concurrency settings.
packages/cli/src/commands/hosts/list/command.ts Adds a 'local' label for the current machine's host entry when is_online is false; compares row.id (machineId HMAC) against getHostId() which is the correct identifier.

Sequence Diagram

sequenceDiagram
    participant QS as QStash Scheduler
    participant API as POST /api/hosts/jobs/sync-presence
    participant UP as Upstash Redis
    participant DB as PostgreSQL (v2_hosts)

    QS->>API: POST (with upstash-signature)
    API->>API: Verify QStash signature
    API->>UP: ZRANGEBYSCORE relay:tunnel-ttl Date.now() +inf
    UP-->>API: connected[] (organizationId:machineId members)
    API->>DB: "UPDATE v2_hosts SET is_online = (id IN connected) WHERE is_online IS DISTINCT FROM expected"
    DB-->>API: flipped rows
    API-->>QS: "200 { connected, flippedOn, flippedOff }"
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:57-76
**Unhandled DB failure crashes the reconciler**

The Redis call above has a `try/catch` that returns a `502`, but the `db.execute` block has none. A transient DB connection error (or a Postgres type error if `connected` somehow produces a malformed `ANY` cast) will throw an unhandled exception — causing Next.js to return a generic 500 with no log message and no `flippedOn`/`flippedOff` counts. QStash will retry into the same silent failure, making this very hard to diagnose. It should be wrapped the same way the Redis read is.

### Issue 2 of 2
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:44-55
**Empty `connected` set silently marks every host offline**

`redis.zrange` on a missing key returns `[]` without throwing — so if `relay:tunnel-ttl` is absent (wrong `KV_REST_API_URL` secret, fresh deployment, accidental key deletion), the reconciler successfully receives an empty array and the `UPDATE` query flips every host to `is_online = false`. There's no way to distinguish "no hosts are genuinely connected" from "wrong Redis instance." Consider logging a warning or returning early when `connected.length === 0`, or adding a sanity threshold before committing the write.

Reviews (1): Last reviewed commit: "feat(api): reconcile host online status ..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts">

<violation number="1" location="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:38">
P2: Do not silently swallow signature verification errors; log the caught error before returning `false` so real verification failures are observable.

(Based on your team's feedback about surfacing async failures instead of empty catches.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread apps/api/src/app/api/hosts/jobs/sync-presence/route.ts Outdated
Comment on lines +57 to +76
const result = await db.execute<{
organization_id: string;
machine_id: string;
is_online: boolean;
}>(sql`
WITH desired AS (
SELECT
organization_id,
machine_id,
(organization_id::text || ':' || machine_id) = ANY(${connected}::text[]) AS expected
FROM v2_hosts
)
UPDATE v2_hosts h
SET is_online = d.expected
FROM desired d
WHERE h.organization_id = d.organization_id
AND h.machine_id = d.machine_id
AND h.is_online IS DISTINCT FROM d.expected
RETURNING h.organization_id, h.machine_id, h.is_online
`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unhandled DB failure crashes the reconciler

The Redis call above has a try/catch that returns a 502, but the db.execute block has none. A transient DB connection error (or a Postgres type error if connected somehow produces a malformed ANY cast) will throw an unhandled exception — causing Next.js to return a generic 500 with no log message and no flippedOn/flippedOff counts. QStash will retry into the same silent failure, making this very hard to diagnose. It should be wrapped the same way the Redis read is.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/api/src/app/api/hosts/jobs/sync-presence/route.ts
Line: 57-76

Comment:
**Unhandled DB failure crashes the reconciler**

The Redis call above has a `try/catch` that returns a `502`, but the `db.execute` block has none. A transient DB connection error (or a Postgres type error if `connected` somehow produces a malformed `ANY` cast) will throw an unhandled exception — causing Next.js to return a generic 500 with no log message and no `flippedOn`/`flippedOff` counts. QStash will retry into the same silent failure, making this very hard to diagnose. It should be wrapped the same way the Redis read is.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +44 to +55
let connected: string[];
try {
connected = await redis.zrange<string[]>(
RELAY_TTL_KEY,
Date.now(),
"+inf",
{ byScore: true },
);
} catch (error) {
console.error("[sync-presence] redis read failed:", error);
return Response.json({ error: "Directory read failed" }, { status: 502 });
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty connected set silently marks every host offline

redis.zrange on a missing key returns [] without throwing — so if relay:tunnel-ttl is absent (wrong KV_REST_API_URL secret, fresh deployment, accidental key deletion), the reconciler successfully receives an empty array and the UPDATE query flips every host to is_online = false. There's no way to distinguish "no hosts are genuinely connected" from "wrong Redis instance." Consider logging a warning or returning early when connected.length === 0, or adding a sanity threshold before committing the write.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/api/src/app/api/hosts/jobs/sync-presence/route.ts
Line: 44-55

Comment:
**Empty `connected` set silently marks every host offline**

`redis.zrange` on a missing key returns `[]` without throwing — so if `relay:tunnel-ttl` is absent (wrong `KV_REST_API_URL` secret, fresh deployment, accidental key deletion), the reconciler successfully receives an empty array and the `UPDATE` query flips every host to `is_online = false`. There's no way to distinguish "no hosts are genuinely connected" from "wrong Redis instance." Consider logging a warning or returning early when `connected.length === 0`, or adding a sanity threshold before committing the write.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

- Wrap the reconcile UPDATE in try/catch and log on failure. Without
  this, a transient DB error throws into Next.js as an opaque 500,
  QStash silently retries into the same failure, no flippedOn/flippedOff
  to anchor diagnosis.
- Refuse to write when the Upstash directory returns an empty connected
  set. An absent or misrouted relay:tunnel-ttl key would otherwise
  silently flip every host to is_online=false. The relay's event-driven
  setOnline path still handles genuine disconnects, so skipping here is
  safe.
- Log on the QStash signature verify catch instead of swallowing the
  error.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
apps/api/src/app/api/hosts/jobs/sync-presence/route.ts (2)

86-100: ⚖️ Poor tradeoff

Reconciliation does a full v2_hosts scan every minute.

The CTE SELECT … FROM v2_hosts materialises every row to compute expected, regardless of how many are actually misaligned. For a small fleet this is fine, but at scale (100k+ hosts × 1 run/min) it becomes a recurring full-table read on a hot OLTP table. Two cheap improvements you might consider before this becomes a problem:

  • Restrict the CTE to candidate rows only: WHERE is_online = true OR (organization_id::text || ':' || machine_id) = ANY(...) — that limits the scan to currently-online hosts plus the connected set, which is the actual diff surface.
  • Composite index on v2_hosts (organization_id, machine_id) if one doesn't already exist (the join in the UPDATE depends on it being indexed for the planner to avoid a second seq scan).

Worth flagging now since the QStash schedule is fixed at 1/min; if you expect host count to grow an order of magnitude, this is the path that gets expensive first.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 86 -
100, The CTE named desired scans all rows in v2_hosts each run; restrict it to
the actual candidate set by adding a WHERE clause to the CTE such as WHERE
is_online = true OR (organization_id::text || ':' || machine_id) =
ANY(${connected}::text[]) so only currently-online hosts and the connected set
are materialized before the UPDATE, and ensure there is a composite index on
(organization_id, machine_id) (create an index if none exists) so the UPDATE ...
FROM desired join on organization_id and machine_id uses the index rather than
triggering a second seq scan.

32-37: 💤 Low value

QStash signature verification URL vulnerable to NEXT_PUBLIC_API_URL formatting with trailing slashes.

QStash signs the exact URL it POSTs to. If NEXT_PUBLIC_API_URL is configured with a trailing slash (e.g., https://api.example.com/), the constructed URL becomes malformed (https://…//api/hosts/jobs/sync-presence) and signature verification fails silently.

The env validation (z.string().url()) accepts but does not normalize trailing slashes. A normalization pattern already exists elsewhere in the codebase (apps/api/src/trpc/context.ts line 9). Consider normalizing NEXT_PUBLIC_API_URL at the env level or at the verification call site:

♻️ Defensive normalisation
+const API_BASE = env.NEXT_PUBLIC_API_URL.replace(/\/+$/, "");
 ...
-      url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`,
+      url: `${API_BASE}/api/hosts/jobs/sync-presence`,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 32 - 37,
The QStash signature verification uses receiver.verify with url:
`${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`, which breaks if
NEXT_PUBLIC_API_URL has a trailing slash; normalize the base URL before
verification (either at env parsing or immediately before calling
receiver.verify) by removing any trailing slash (e.g., use a trimEnd('/')
equivalent) and then build the full path, ensuring receiver.verify receives the
exact URL QStash signed; update references in route.ts around receiver.verify
and/or the env parsing to use the normalized NEXT_PUBLIC_API_URL.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`:
- Around line 49-54: The handler reads bare hostId values into connected from
RELAY_TTL_KEY but later compares against SQL rows built as
organization_id:machine_id, so expected will always be false; fix by
reverse-mapping each hostId using the tunnel-owner HSET (the "tunnel-owner" key)
to obtain the corresponding machine_id/organization_id pair and reconstruct the
compound key before comparison (use redis.hget/hmget on the tunnel-owner mapping
for the hostId(s) retrieved into connected and build
`${organization_id}:${machine_id}` to compare with the SQL identifier), or
alternatively change relay's zadd in apps/relay/src/directory.ts to store
`${organizationId}:${hostId}` as the member so the formats match—implement one
of these two fixes and keep RELAY_TTL_KEY/connected usage consistent with the
chosen mapping approach.
- Around line 47-58: Add explicit per-driver timeouts: recreate or ensure the
Redis client used by this route is constructed with a signal (e.g. new Redis({
..., signal: AbortSignal.timeout(5000) })) so all redis.zrange calls have a 5s
deadline, and replace any existing redis client usage in this file (including
the other call sites around lines 80-105) to use that client; for DB calls
(db.execute in this file) wrap the execute call with an AbortSignal timeout or
add a SQL statement timeout so the query has a bounded deadline (e.g. call
db.execute within AbortSignal.timeout(...) or include a query-level
statement_timeout), and ensure the catch blocks still handle and log AbortErrors
consistently.

---

Nitpick comments:
In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts`:
- Around line 86-100: The CTE named desired scans all rows in v2_hosts each run;
restrict it to the actual candidate set by adding a WHERE clause to the CTE such
as WHERE is_online = true OR (organization_id::text || ':' || machine_id) =
ANY(${connected}::text[]) so only currently-online hosts and the connected set
are materialized before the UPDATE, and ensure there is a composite index on
(organization_id, machine_id) (create an index if none exists) so the UPDATE ...
FROM desired join on organization_id and machine_id uses the index rather than
triggering a second seq scan.
- Around line 32-37: The QStash signature verification uses receiver.verify with
url: `${env.NEXT_PUBLIC_API_URL}/api/hosts/jobs/sync-presence`, which breaks if
NEXT_PUBLIC_API_URL has a trailing slash; normalize the base URL before
verification (either at env parsing or immediately before calling
receiver.verify) by removing any trailing slash (e.g., use a trimEnd('/')
equivalent) and then build the full path, ensuring receiver.verify receives the
exact URL QStash signed; update references in route.ts around receiver.verify
and/or the env parsing to use the normalized NEXT_PUBLIC_API_URL.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 408bb6a9-dcf7-4fe2-be5f-38c453129cee

📥 Commits

Reviewing files that changed from the base of the PR and between 1dc295a and c4ad013.

📒 Files selected for processing (1)
  • apps/api/src/app/api/hosts/jobs/sync-presence/route.ts

Comment on lines +47 to +58
let connected: string[];
try {
connected = await redis.zrange<string[]>(
RELAY_TTL_KEY,
Date.now(),
"+inf",
{ byScore: true },
);
} catch (error) {
console.error("[sync-presence] redis read failed:", error);
return Response.json({ error: "Directory read failed" }, { status: 502 });
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

@upstash/redis 1.37 zrange AbortSignal option

💡 Result:

@upstash/redis (upstash/redis-js) does not support an AbortSignal option on the ZRANGE command options (e.g., there is no zrange(..., { AbortSignal:... }) or zrange(..., { signal:... }) supported in the command’s option type). What it does support is passing an AbortSignal to the Redis client itself, via the client configuration. The Node.js platform config includes a signal option (“The signal will allow aborting requests on the fly”) [1]. Additionally, the underlying HTTP request layer for commands accepts signal?: AbortSignal [2], and the command execution passes signal: this.signal into the request [3]. How to use (example) 1) Create the client with a signal const controller = new AbortController; const redis = new Redis({ url: process.env.UPSTASH_REDIS_REST_URL!, token: process.env.UPSTASH_REST_TOKEN!, signal: controller.signal, }); // cancel in-flight request controller.abort; This pattern is shown directly in Upstash’s client configuration docs [4]. 2) Or set a timeout signal (example uses AbortSignal.timeout) Upstash’s advanced docs show configuring the SDK with signal: () => AbortSignal.timeout(1000) [5]. So for your exact query (“@upstash/redis 1.37 zrange AbortSignal option”): use client-level signal, not a per-command zrange option; zrange options in code include only ZRANGE-specific flags like withScores, rev, byScore, byLex, and pagination (offset/count) [6], and Upstash’s ZRANGE docs list only those ZRANGE arguments (with no AbortSignal mention) [7].

Citations:


🏁 Script executed:

# Check package.json for `@upstash/redis` and drizzle versions
find . -name "package.json" -type f | head -5 | xargs grep -l "@upstash/redis\|drizzle"

Repository: superset-sh/superset

Length of output: 177


🏁 Script executed:

# Read the complete route file to understand the implementation
cat -n apps/api/src/app/api/hosts/jobs/sync-presence/route.ts

Repository: superset-sh/superset

Length of output: 3984


🏁 Script executed:

# Check for Redis client initialization in the codebase
fd -e ts -e tsx | xargs grep -l "new Redis\|Redis({" | head -10

Repository: superset-sh/superset

Length of output: 279


🏁 Script executed:

# Check `@upstash/redis` and drizzle versions in package.json
cat apps/api/package.json | grep -A 2 -B 2 "@upstash/redis\|drizzle-orm"

Repository: superset-sh/superset

Length of output: 315


🏁 Script executed:

# Check how Redis client is used elsewhere in the codebase
rg "redis\.(zrange|execute)" --type ts --type tsx -B 2 -A 2

Repository: superset-sh/superset

Length of output: 92


🏁 Script executed:

# Look for any existing timeout patterns in the codebase
rg "AbortSignal|statement_timeout|timeout" --type ts --type tsx apps/api | head -20

Repository: superset-sh/superset

Length of output: 92


🏁 Script executed:

# Search for redis usage patterns more broadly
rg "redis\." apps/api --type ts -B 1 -A 1 | head -30

Repository: superset-sh/superset

Length of output: 298


🏁 Script executed:

# Check Drizzle documentation patterns and db.execute in codebase
rg "db\.execute" apps/api --type ts -B 2 -A 2

Repository: superset-sh/superset

Length of output: 438


🏁 Script executed:

# Look for any timeout patterns in API routes
rg "timeout|AbortController|AbortSignal" apps/api --type ts

Repository: superset-sh/superset

Length of output: 46


🏁 Script executed:

# Check Drizzle db client configuration in the codebase
find . -name "*db*" -type f -path "*/packages/db/*" | head -10

Repository: superset-sh/superset

Length of output: 119


🏁 Script executed:

# Look for db client initialization
rg "createPool|postgres\(" --type ts -B 2 -A 5 | head -40

Repository: superset-sh/superset

Length of output: 46


🏁 Script executed:

# Find the db client file
find . -path "*/db/client*" -type f

Repository: superset-sh/superset

Length of output: 46


🏁 Script executed:

# Find db package structure
ls -la packages/db/

Repository: superset-sh/superset

Length of output: 485


🏁 Script executed:

# Check for database configuration
rg "new Database|new Pool|postgres\(" packages/ --type ts -B 1 -A 3 | head -50

Repository: superset-sh/superset

Length of output: 1693


🏁 Script executed:

# Read the db client configuration
cat packages/db/src/client.ts

Repository: superset-sh/superset

Length of output: 619


🏁 Script executed:

# Check drizzle-orm types for execute method
rg "execute.*function\|execute.*Promise" packages/ --type ts -B 2 -A 5 | head -50

Repository: superset-sh/superset

Length of output: 46


🏁 Script executed:

# Check how db is actually imported in the sync-presence route
head -10 apps/api/src/app/api/hosts/jobs/sync-presence/route.ts

Repository: superset-sh/superset

Length of output: 329


🏁 Script executed:

# Check if there are any other examples of timeout handling in the codebase
rg "controller.*abort\|AbortController" . --type ts -B 2 -A 2

Repository: superset-sh/superset

Length of output: 46


🏁 Script executed:

# Verify what db import path resolves to and check for timeout patterns
rg "@superset/db" apps/ --type ts -B 1 -A 1 | head -20

Repository: superset-sh/superset

Length of output: 1378


Add explicit timeouts to the Redis and DB calls with correct per-driver approaches.

QStash will retry on non-2xx, but a hung request can sit on a Node handler indefinitely and tie up retries behind it. Neither redis.zrange nor db.execute here has a deadline. However, the implementation approach depends on the drivers:

  • Redis (@upstash/redis 1.37.0): Does not support AbortSignal in zrange options. Instead, pass signal to the Redis client constructor: new Redis({ ..., signal: AbortSignal.timeout(5000) }). This applies to all commands.
  • Database (Neon serverless + Drizzle 0.45.2): The Neon serverless driver is HTTP-based. Use SQL-level statement timeouts within the query or wrap with AbortSignal.timeout() on the db.execute() call itself.

A bounded timeout keeps the reconciler self-healing under upstream slowness.

Also applies to: 80-105

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 47 - 58,
Add explicit per-driver timeouts: recreate or ensure the Redis client used by
this route is constructed with a signal (e.g. new Redis({ ..., signal:
AbortSignal.timeout(5000) })) so all redis.zrange calls have a 5s deadline, and
replace any existing redis client usage in this file (including the other call
sites around lines 80-105) to use that client; for DB calls (db.execute in this
file) wrap the execute call with an AbortSignal timeout or add a SQL statement
timeout so the query has a bounded deadline (e.g. call db.execute within
AbortSignal.timeout(...) or include a query-level statement_timeout), and ensure
the catch blocks still handle and log AbortErrors consistently.

Comment on lines +49 to +54
connected = await redis.zrange<string[]>(
RELAY_TTL_KEY,
Date.now(),
"+inf",
{ byScore: true },
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the relay's directory contract matches this consumer.
fd -t f 'directory\.(ts|js)$' apps/relay
rg -nP -C3 'tunnel-ttl|relay:tunnel' apps/relay
# Look at how scores are computed and how members are formatted.
rg -nP -C5 '\b(zadd|ZADD|expireAt|score)\b' apps/relay
rg -nP -C3 'machine[_-]?id.*organization|organization.*machine' apps/relay/src

Repository: superset-sh/superset

Length of output: 4490


Critical mismatch: relay writes hostId members, but SQL expects organization_id:machine_id format.

The relay stores ZSET members as bare hostId values (line 111 in apps/relay/src/directory.ts):

redis.zadd(TTL_KEY, { score: now + TTL_GRACE_MS, member: hostId })

This handler retrieves that array into connected and then attempts to match it against a SQL WHERE clause that builds identifiers as organization_id || ':' || machine_id. Since the formats do not match, every row will have expected = false and all connected hosts will be flipped offline silently.

Fix: Either

  1. Have the relay write members as ${organizationId}:${hostId} (requires changes to apps/relay/src/directory.ts register and sweep), or
  2. Parse the hostId from connected[] to look up the corresponding organization_id in a separate index before comparing, or
  3. Use the tunnel-owner HSET to reverse-map hostIdmachineId and reconstruct the compound key.

The score units (milliseconds) and key name (relay:tunnel-ttl) are correct.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/src/app/api/hosts/jobs/sync-presence/route.ts` around lines 49 - 54,
The handler reads bare hostId values into connected from RELAY_TTL_KEY but later
compares against SQL rows built as organization_id:machine_id, so expected will
always be false; fix by reverse-mapping each hostId using the tunnel-owner HSET
(the "tunnel-owner" key) to obtain the corresponding machine_id/organization_id
pair and reconstruct the compound key before comparison (use redis.hget/hmget on
the tunnel-owner mapping for the hostId(s) retrieved into connected and build
`${organization_id}:${machine_id}` to compare with the SQL identifier), or
alternatively change relay's zadd in apps/relay/src/directory.ts to store
`${organizationId}:${hostId}` as the member so the formats match—implement one
of these two fixes and keep RELAY_TTL_KEY/connected usage consistent with the
chosen mapping approach.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts">

<violation number="1" location="apps/api/src/app/api/hosts/jobs/sync-presence/route.ts:63">
P1: Skipping reconciliation when `connected` is empty breaks the legitimate “all hosts offline” case and can leave stale `is_online=true` rows unhealed.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment on lines +63 to +73
if (connected.length === 0) {
console.warn(
"[sync-presence] empty connected set; skipping reconcile to avoid mass-flip",
);
return Response.json({
connected: 0,
flippedOn: 0,
flippedOff: 0,
skipped: true,
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Skipping reconciliation when connected is empty breaks the legitimate “all hosts offline” case and can leave stale is_online=true rows unhealed.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/app/api/hosts/jobs/sync-presence/route.ts, line 63:

<comment>Skipping reconciliation when `connected` is empty breaks the legitimate “all hosts offline” case and can leave stale `is_online=true` rows unhealed.</comment>

<file context>
@@ -54,29 +57,55 @@ export async function POST(request: Request): Promise<Response> {
+	// Refuse to mass-flip when the directory comes back empty — most likely a
+	// misconfigured KV credential or a wiped key, not a real zero-host state.
+	// The relay's event-driven setOnline writes still cover genuine disconnects.
+	if (connected.length === 0) {
+		console.warn(
+			"[sync-presence] empty connected set; skipping reconcile to avoid mass-flip",
</file context>
Suggested change
if (connected.length === 0) {
console.warn(
"[sync-presence] empty connected set; skipping reconcile to avoid mass-flip",
);
return Response.json({
connected: 0,
flippedOn: 0,
flippedOff: 0,
skipped: true,
});
}
// Allow empty directory snapshots so reconciliation can flip all hosts offline when none are connected.

@saddlepaddle saddlepaddle merged commit 9f67fb5 into main May 13, 2026
17 checks passed
saddlepaddle added a commit that referenced this pull request May 13, 2026
Changes since v0.2.15:

- workspaces: `superset workspaces list` table now shows the workspace
  ID column. (#4463)
- docs: VHS demo walkthrough recording added under `packages/cli/demo/`.
  (#4461)
- auth: `superset auth login --api-key sk_live_…` stores an API key at
  `~/.superset/config.json` instead of running the OAuth flow; `whoami`
  and `logout` updated to recognize stored keys. (#4472)
- hosts: `superset hosts list` shows `local` for the host machine you're
  invoking from, distinct from `online`/`no`. (#4476)
- automations: `--agent` is now the host agent instance/preset id (e.g.
  `claude`, `codex`, `superset`) and is resolved live from the host on
  every run. The `--agent-config-file` flag is removed; create/update
  rename `agentConfig` -> `agent` end-to-end. (#4481)

Push cli-v0.2.16 after this lands to fire the release pipeline.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant