-
Notifications
You must be signed in to change notification settings - Fork 20
Native Postgres connector: memory feeds, live pushdown, connection-backed derived entities #1182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
buremba
wants to merge
14
commits into
main
Choose a base branch
from
feat/postgres-connector
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
74fa97e
feat(connectors,server): postgres connector + external-backed derived…
buremba 372289c
fix(server): cloud-mode gate on external executor + DML-CTE guard; tests
buremba dbb960b
refactor(server): extract shared pg-oid util (dedup query_sql + execu…
buremba 23559aa
fix(connectors): reject write-capable CTEs + run probe read-only (pi …
buremba 5d08ea5
refactor: lean pushdown — connector query() replaces gateway executor
buremba 12c793b
fix(server): connection visibility + execution-time cloud gate on pus…
buremba 1e3aa5d
chore: pushdown limit/offset validation; drop unused virtual feed fla…
buremba 0ac9091
fix(connectors): postgres parser fallback, pagination/total, cloud-ga…
buremba ac734e0
docs(database-connectors): correct virtual-flag + cloud-gate claims (…
buremba 2caca7f
feat(connectors): SSRF egress guard + dogfood E2E + pollWorkerJob gat…
buremba 3c0139d
docs(database-connectors): document the implemented egress guard
buremba c4f8d83
fix(connectors): harden egress guard + close review findings
buremba 1b52a21
fix(connectors): namespace postgres origin_id by feed instance (no cr…
buremba db46e94
chore: pin owletto submodule to main baseline
buremba File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
28 changes: 28 additions & 0 deletions
28
db/migrations/20260601120000_entity_types_backing_source.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| -- migrate:up | ||
|
|
||
| -- External-backed derived entity types. A derived entity type (backing_sql IS NOT | ||
| -- NULL) normally runs its view over Lobu's own org-scoped tables. When | ||
| -- backing_source references a connection, the view instead executes LIVE against | ||
| -- that connection's single external database (read-only, no copy): the read goes | ||
| -- get_type → query_sql({ sql: backing_sql, connection: backing_source }) → | ||
| -- runConnectorQuery (connector-pushdown.ts), which runs the SQL in the | ||
| -- connection's connector. NULL ⇒ internal (today's behavior). backing_source is | ||
| -- only meaningful on a derived type (backing_sql IS NOT NULL). | ||
| -- | ||
| -- Deliberately NO foreign key to connections: if the source connection is deleted | ||
| -- the read must FAIL ("source connection no longer exists") rather than silently | ||
| -- fall back to internal scoping (ON DELETE SET NULL — which would run external SQL | ||
| -- against internal tables) or block connection deletion (ON DELETE RESTRICT). | ||
| -- runConnectorQuery validates the connection exists, is in-org, and is visible to | ||
| -- the caller at read time. | ||
| -- | ||
| -- Stored as the connection SLUG (text), not an id: the slug is what the config | ||
| -- diff compares (no churn), it survives a connection delete+recreate, and | ||
| -- runConnectorQuery resolves slug → connection → DATABASE_URL fresh at read time. | ||
| -- | ||
| -- Idempotent: no-op on databases that already have the column. | ||
| ALTER TABLE public.entity_types ADD COLUMN IF NOT EXISTS backing_source text; | ||
|
|
||
| -- migrate:down | ||
|
|
||
| ALTER TABLE public.entity_types DROP COLUMN IF EXISTS backing_source; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| # Database connectors (Postgres) — design + gating | ||
|
|
||
| Bring an external database in as memory, and read it live (no copy) for derived | ||
| entities. V1 ships **Postgres**; Snowflake/BigQuery are additive (see end). | ||
|
|
||
| ## The model: connectors push compute down; Lobu aggregates | ||
|
|
||
| The connector owns the DB connection — for *both* indexing and live reads. The | ||
| gateway never opens an external pool. | ||
|
|
||
| - **Memory feed (indexed)** — a `postgres` connection + a `query` feed runs a | ||
| read-only `SELECT` on a schedule, keyset-incremental, and emits one event per | ||
| row → embedded, searchable memory. (`packages/connectors/src/postgres.ts`) | ||
| - **Live read (no copy)** — the connector's `query()` runs SQL live against the | ||
| source and returns rows, persisting nothing. The platform reaches it through one | ||
| primitive: `runConnectorQuery` (`packages/server/src/lib/connector-pushdown.ts`), | ||
| which invokes the connector in the worker `query` run-mode (the same inline-run | ||
| path as `operations.execute`). | ||
| - **`query_sql({ connection })`** is the single door: with a `connection` slug it | ||
| pushes the SQL down via `runConnectorQuery` (internal org-scoping skipped — it's | ||
| the org's own DB); without, it runs the internal org-scoped path. There is no | ||
| separate `query_entity_type` tool. | ||
| - **Derived entity** — `defineEntityType({ backing: { sql, connection? } })`. With | ||
| `connection`, the read is `get_type → query_sql({ sql: backing_sql, connection })` | ||
| → pushdown. Without, it's the shipped internal view over `events`/`entities`. | ||
|
|
||
| Single-database only: every query targets one database; no cross-source joins | ||
| (that's a later DuckDB-class engine). | ||
|
|
||
| Slice 2 (next): **virtual feeds** (a `virtual` feed flag → live reads, no events) | ||
| and **federated search** (a connector `search()` the platform fans out to and | ||
| merges with the vector index). Only the `query()` live-read primitive is in place | ||
| today; the `virtual` feed flag, `search()`, and the fan-out are the remaining work. | ||
|
|
||
| ## SSRF / egress trust model | ||
|
|
||
| The DB socket lives in the **connector subprocess**, behind the worker egress | ||
| controls — not the gateway. The dogfood reaches Lobu's own private PG, so the HTTP | ||
| scrapers' block-all-private-IPs rule can't be reused. | ||
|
|
||
| - **Self-hosted / first-party:** `DATABASE_URL` is an operator-set secret — same | ||
| trust boundary as any other env secret. Private IPs allowed. Ships now. | ||
| - **Untrusted multi-tenant cloud:** a tenant-supplied `DATABASE_URL` (metadata | ||
| IPs, internal CIDRs, another tenant's DB) is an exfil/scan vector. **Not allowed | ||
| yet.** Under `LOBU_CLOUD_MODE=1` the postgres connector is hidden from the | ||
| catalog (`connector-catalog.ts`) and connection-create is hard-blocked | ||
| (`manage_connections.ts` via `connector-cloud-gate.ts`). Execution is gated | ||
| independently at every run path, not just by catalog-hide: scheduled-sync run | ||
| creation (`queue-helpers.ts`), the production worker poll (`worker-api.ts`), the | ||
| dev-CLI sync (`feed-sync.ts`), and the live pushdown (`connector-pushdown.ts`) | ||
| each refuse a cloud-restricted connector under `LOBU_CLOUD_MODE`. | ||
|
|
||
| **Egress guard (`packages/connectors/src/db-egress-guard.ts`).** The connector | ||
| runs a pre-connect host check on both `sync()` and `query()`. Policy comes from | ||
| `ctx.config.LOBU_DB_EGRESS_POLICY`, injected by the server from cloud mode: | ||
|
|
||
| - `allow-private` (self-hosted, the default) — allows loopback / RFC1918 / CGNAT | ||
| / ULA, but still blocks link-local + cloud metadata (`169.254/16`), multicast, | ||
| and the unspecified address (no DB lives there). | ||
| - `block-private` (cloud) — blocks **every** non-public address. A hostname is | ||
| resolved and rejected if ANY returned address is blocked (multi-record rebind), | ||
| with IPv4-mapped / NAT64 / zone-id normalization and fail-closed on malformed | ||
| literals. | ||
|
|
||
| **Remaining before enabling on cloud** (then remove the key from | ||
| `CLOUD_RESTRICTED_CONNECTOR_KEYS`): pin the resolved IP into the socket to close | ||
| the DNS-rebind TOCTOU across the pool, force TLS when the URL omits it, and a | ||
| per-org allowlist. The classifier + reject is in place and tested; the gate is | ||
| what currently keeps untrusted tenants out. | ||
|
|
||
| ## Entitlement boundary (design-only — not yet built) | ||
|
|
||
| Gate advanced database connectivity behind a paid tier. Seam: `organization.plan` | ||
| (`free` | `pro` | `enterprise`) + a check in the `multi-tenant.ts` auth resolver. | ||
|
|
||
| | Capability | Tier | | ||
| | --- | --- | | ||
| | Postgres connector + memory feeds | free / pro | | ||
| | Internal derived entities | free / pro | | ||
| | External-backed (live) derived entities — `backing.connection` set | pro / enterprise | | ||
| | Warehouse connectors (Snowflake, BigQuery), virtual feeds + federated search | enterprise | | ||
|
|
||
| Enforcement points when built: connector install, connection count, and presence | ||
| of `backing.connection`. | ||
|
|
||
| ## Snowflake / BigQuery forward-compat | ||
|
|
||
| No redesign needed: each is a new bundled connector implementing `sync()` + | ||
| `query()` (+ later `search()`), with `env_keys` carrying its credentials | ||
| (Snowflake account/user/keypair/warehouse/role; BigQuery service-account JSON). | ||
| The pushdown plumbing (`runConnectorQuery`, the `query` run-mode, `query_sql`'s | ||
| `connection`) is dialect-agnostic — only the connector's own `query()` differs. | ||
| Metered warehouses make "live, every read" costly → those lean on the indexed | ||
| (memory-feed) path or materialization. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the signup feed primary key unique per emitted row.
primary_key: "id"points tou.id, but this query can emit multiple rows per user (one per org). That can cause cursor collisions and skipped/overwritten events across pages. Use a row-unique key for this feed (e.g.,user_id + org).Proposed fix
config: { - primary_key: "id", + primary_key: "signup_row_id", cursor_column: "created_at", // Base SELECT only — the connector adds the keyset WHERE / ORDER BY / LIMIT. - query: `SELECT u.id, u.email, u.name, u."createdAt" AS created_at, o.slug AS org + query: `SELECT concat(u.id::text, ':', o.slug) AS signup_row_id, + u.id AS user_id, + u.email, + u.name, + u."createdAt" AS created_at, + o.slug AS org FROM "user" u JOIN member m ON m."userId" = u.id JOIN organization o ON o.id = m."organizationId"`,🤖 Prompt for AI Agents