Skip to content

refactor(pgvector-embedded): vendor shared control/SQL once, not per-platform (−2.7k lines)#978

Merged
buremba merged 2 commits into
mainfrom
chore/pgvector-dedup-prebuilt-sql
May 20, 2026
Merged

refactor(pgvector-embedded): vendor shared control/SQL once, not per-platform (−2.7k lines)#978
buremba merged 2 commits into
mainfrom
chore/pgvector-dedup-prebuilt-sql

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 20, 2026

What

vector.control + vector--0.8.1.sql are byte-identical across all four platforms (verified by md5) — only the compiled library (vector.so/vector.dylib) is platform-specific. They were vendored 4×, bloating the repo by ~2,750 lines of duplicated install SQL.

Move them to the prebuilt root (vendored once); keep only the binary per prebuilt/<platform>/.

Changes

  • injector (src/index.ts): library from prebuilt/<platform>/, control/SQL from the shared root.
  • build.sh: stage library → platform dir, control/SQL → root.
  • build-pgvector-embedded.yml: artifact carries the shared files; open-pr stages binary per-platform + control/SQL once.

Validation

  • build.sh reproduces the committed layout (darwin-arm64 binary byte-identical).
  • E2E: inject into a fresh embedded PG → CREATE EXTENSION vector → distance query — all pass ('[1,2,3]'::vector <-> '[1,2,4]'::vector = 1).
  • typecheck 0.

Net: 71 insertions, 2,794 deletions. Follows up #965.

Summary by CodeRabbit

  • Refactor

    • Reorganized prebuilt layout so platform-specific native libraries are stored per-platform and extension metadata/scripts are shared once at the package root for simpler distribution.
  • Chores

    • Updated build workflow, staging, and packaging scripts to assemble artifacts into separate platform and shared locations and to copy/install them accordingly.

Review Change Stack

…platform

vector.control + vector--<version>.sql are byte-identical across all four
platforms (verified) — only the compiled library differs. They were vendored
4×, bloating the tree by ~2,750 lines of duplicated install SQL. Move them to
the prebuilt root (vendored once); keep only the platform binary per dir.

- injector (src/index.ts): read control/SQL from the shared root, library from
  prebuilt/<platform>/.
- build.sh: stage library → prebuilt/<platform>/, control/SQL → prebuilt/ root.
- build-pgvector-embedded.yml: artifact carries the shared files; open-pr stages
  binary per-platform + the shared control/SQL once.

Validated: build.sh reproduces the layout; E2E inject into embedded PG +
CREATE EXTENSION vector + a distance query all pass. Net -2,723 lines.
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 566dbd80-4455-4c5b-a6b5-d46830ae41f8

📥 Commits

Reviewing files that changed from the base of the PR and between 4aea9ea and 36a62c4.

📒 Files selected for processing (1)
  • packages/pgvector-embedded/src/index.ts

📝 Walkthrough

Walkthrough

This PR reorganizes pgvector-embedded artifact storage to avoid duplication: platform-specific native binaries remain in prebuilt/<platform>/, while extension control metadata and SQL definitions move to a shared prebuilt/ root. Build, workflow, and injection logic are updated to reflect this layout.

Changes

pgvector-embedded artifact separation

Layer / File(s) Summary
Build script separation of platform and shared artifacts
packages/pgvector-embedded/scripts/build.sh
Build script introduces SHARED_DIR for vector.control and vector--<version>.sql, while platform .so/.dylib files stage in per-platform OUT_DIR. Directory setup, library copying, and control/SQL vendoring are now separate steps with updated completion logging.
Workflow artifact upload for platform and shared directories
.github/workflows/build-pgvector-embedded.yml
Artifact upload now includes both platform-specific directories and shared control/SQL files. Regeneration stage clears per-platform directories, copies only platform subdirectory contents, then overwrites shared control/SQL at the prebuilt root.
Injector refactoring for dual-source artifact copying
packages/pgvector-embedded/src/index.ts
New prebuiltDir() helper and hasPrebuiltLibrary() check detect platform-specific binaries. Extended hasPrebuilt() requires both a usable platform library and prebuilt/vector.control. Refactored injectPgvector() splits copying: native libraries from platform directory to native/lib/postgresql, control/SQL from shared root to native/share/postgresql/extension. Module documentation updated.
Removal of duplicate control files from platform directories
packages/pgvector-embedded/prebuilt/linux-arm64/vector.control
Platform-specific vector.control files removed as they are now vendored at the shared prebuilt/ root.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • lobu-ai/lobu#965: The main PR's changes to packages/pgvector-embedded artifact staging/injection (src/index.ts copy logic and prebuilt vector.control/vector--*.sql layout) directly support the retrieved PR's embedded Postgres boot flow, which injects pgvector native assets at runtime.

Poem

🐰 The vectors once lived in scattered nests,
Platform by platform, without rest.
Now shared roots hold their common core,
While platform libs guard the native door.
One layout true, no duplication's chore! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main refactoring objective: consolidating identical platform-agnostic files into a shared location and removing duplicate vendoring across platforms.
Description check ✅ Passed The description includes a clear summary explaining the duplication issue, detailed breakdown of changes across multiple files, and comprehensive validation results demonstrating functionality. All key information is present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/pgvector-dedup-prebuilt-sql

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

The previous commit's biome --write reformatting wasn't re-staged before
the commit, so CI format:check flagged it. Commit the formatted version.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/pgvector-embedded/scripts/build.sh`:
- Around line 51-76: The shared prebuilt dir (SHARED_DIR) can retain old
vector.control and vector--*.sql files across version bumps; before copying the
new vector.control and vector--${PGVECTOR_SQL_VERSION}.sql into SHARED_DIR,
remove any existing vector.control and vector--*.sql in SHARED_DIR (e.g., purge
files matching "vector.control" and "vector--*.sql") so only the pinned
SQL/control are staged and no stale artifacts remain.

In `@packages/pgvector-embedded/src/index.ts`:
- Around line 49-54: The hasPrebuilt(platform: string = currentPlatformKey())
function currently returns true if the platform library and the shared control
file exist but doesn't verify the presence of the platform-specific SQL
artifact, causing runtime failures; update hasPrebuilt to also check for the
corresponding vector--*.sql file under PREBUILT_ROOT (e.g., match the platform
key to find a file like "vector--<platform>.sql" or search for files matching
"vector--*.sql") before returning true, using the existing PREBUILT_ROOT
constant and helper utilities (or fs.readdirSync/existsSync) so that
hasPrebuiltLibrary(platform) && existsSync(join(PREBUILT_ROOT,
"vector.control")) && sqlFileExists returns the final boolean.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: eaa93dec-2ce1-49b5-97c5-49414e4abd48

📥 Commits

Reviewing files that changed from the base of the PR and between 7793c56 and 4aea9ea.

📒 Files selected for processing (11)
  • .github/workflows/build-pgvector-embedded.yml
  • packages/pgvector-embedded/prebuilt/darwin-x64/vector--0.8.1.sql
  • packages/pgvector-embedded/prebuilt/darwin-x64/vector.control
  • packages/pgvector-embedded/prebuilt/linux-arm64/vector--0.8.1.sql
  • packages/pgvector-embedded/prebuilt/linux-arm64/vector.control
  • packages/pgvector-embedded/prebuilt/linux-x64/vector--0.8.1.sql
  • packages/pgvector-embedded/prebuilt/linux-x64/vector.control
  • packages/pgvector-embedded/prebuilt/vector--0.8.1.sql
  • packages/pgvector-embedded/prebuilt/vector.control
  • packages/pgvector-embedded/scripts/build.sh
  • packages/pgvector-embedded/src/index.ts
💤 Files with no reviewable changes (6)
  • packages/pgvector-embedded/prebuilt/linux-x64/vector.control
  • packages/pgvector-embedded/prebuilt/darwin-x64/vector--0.8.1.sql
  • packages/pgvector-embedded/prebuilt/darwin-x64/vector.control
  • packages/pgvector-embedded/prebuilt/linux-x64/vector--0.8.1.sql
  • packages/pgvector-embedded/prebuilt/linux-arm64/vector--0.8.1.sql
  • packages/pgvector-embedded/prebuilt/linux-arm64/vector.control

Comment on lines +51 to +76
SHARED_DIR="${PKG_ROOT}/prebuilt"

rm -rf "${OUT_DIR}"
mkdir -p "${OUT_DIR}"
mkdir -p "${OUT_DIR}" "${SHARED_DIR}"

# Stage straight from the build dir — we run `make` but not `make install`, so
# nothing lands in the OS Postgres's pkglibdir/sharedir. The library, control
# file, and generated SQL all sit under the cloned/built pgvector tree.
#
# Compiled extension library — name differs per platform ($(DLSUFFIX)).
# Compiled extension library is the only PLATFORM-SPECIFIC artifact → goes in
# prebuilt/<platform>/ (name differs per platform via $(DLSUFFIX)).
cp "${WORK}/pgvector/vector"*.so "${OUT_DIR}/" 2>/dev/null || true
cp "${WORK}/pgvector/vector"*.dylib "${OUT_DIR}/" 2>/dev/null || true
# Fall back to an installed copy only if the build dir somehow lacks the lib.
if ! ls "${OUT_DIR}"/vector.* >/dev/null 2>&1; then
cp "${PKGLIBDIR}/vector".* "${OUT_DIR}/"
fi

# Control + the full-install SQL for the pinned version only. CREATE EXTENSION
# at default_version reads vector--<version>.sql directly; the vector--A--B.sql
# upgrade scripts are only for ALTER EXTENSION ... UPDATE, which never runs on a
# fresh embedded DB, so we don't ship them. `vector.control` is checked into the
# repo; `sql/vector--<version>.sql` is generated by `make`.
# Control + the full-install SQL for the pinned version are byte-identical
# across platforms, so they're vendored ONCE at the prebuilt root (not per
# platform). CREATE EXTENSION at default_version reads vector--<version>.sql
# directly; the vector--A--B.sql upgrade scripts only run on ALTER EXTENSION
# UPDATE, never on a fresh embedded DB, so we don't ship them.
PGVECTOR_SQL_VERSION="${PGVECTOR_VERSION#v}"
cp "${WORK}/pgvector/vector.control" "${OUT_DIR}/"
cp "${WORK}/pgvector/sql/vector--${PGVECTOR_SQL_VERSION}.sql" "${OUT_DIR}/"
cp "${WORK}/pgvector/vector.control" "${SHARED_DIR}/"
cp "${WORK}/pgvector/sql/vector--${PGVECTOR_SQL_VERSION}.sql" "${SHARED_DIR}/"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear previously vendored shared SQL/control files before staging the pinned version.

The script rewrites platform output but not the shared root. On a version bump, old vector--*.sql can persist and get re-uploaded via the wildcard path, so artifacts become non-deterministic.

Suggested patch
 SHARED_DIR="${PKG_ROOT}/prebuilt"

 rm -rf "${OUT_DIR}"
 mkdir -p "${OUT_DIR}" "${SHARED_DIR}"
+rm -f "${SHARED_DIR}/vector.control"
+rm -f "${SHARED_DIR}"/vector--*.sql

 # Stage straight from the build dir — we run `make` but not `make install`, so
 # nothing lands in the OS Postgres's pkglibdir/sharedir. The library, control
@@
 PGVECTOR_SQL_VERSION="${PGVECTOR_VERSION#v}"
 cp "${WORK}/pgvector/vector.control" "${SHARED_DIR}/"
 cp "${WORK}/pgvector/sql/vector--${PGVECTOR_SQL_VERSION}.sql" "${SHARED_DIR}/"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SHARED_DIR="${PKG_ROOT}/prebuilt"
rm -rf "${OUT_DIR}"
mkdir -p "${OUT_DIR}"
mkdir -p "${OUT_DIR}" "${SHARED_DIR}"
# Stage straight from the build dir — we run `make` but not `make install`, so
# nothing lands in the OS Postgres's pkglibdir/sharedir. The library, control
# file, and generated SQL all sit under the cloned/built pgvector tree.
#
# Compiled extension library — name differs per platform ($(DLSUFFIX)).
# Compiled extension library is the only PLATFORM-SPECIFIC artifact → goes in
# prebuilt/<platform>/ (name differs per platform via $(DLSUFFIX)).
cp "${WORK}/pgvector/vector"*.so "${OUT_DIR}/" 2>/dev/null || true
cp "${WORK}/pgvector/vector"*.dylib "${OUT_DIR}/" 2>/dev/null || true
# Fall back to an installed copy only if the build dir somehow lacks the lib.
if ! ls "${OUT_DIR}"/vector.* >/dev/null 2>&1; then
cp "${PKGLIBDIR}/vector".* "${OUT_DIR}/"
fi
# Control + the full-install SQL for the pinned version only. CREATE EXTENSION
# at default_version reads vector--<version>.sql directly; the vector--A--B.sql
# upgrade scripts are only for ALTER EXTENSION ... UPDATE, which never runs on a
# fresh embedded DB, so we don't ship them. `vector.control` is checked into the
# repo; `sql/vector--<version>.sql` is generated by `make`.
# Control + the full-install SQL for the pinned version are byte-identical
# across platforms, so they're vendored ONCE at the prebuilt root (not per
# platform). CREATE EXTENSION at default_version reads vector--<version>.sql
# directly; the vector--A--B.sql upgrade scripts only run on ALTER EXTENSION
# UPDATE, never on a fresh embedded DB, so we don't ship them.
PGVECTOR_SQL_VERSION="${PGVECTOR_VERSION#v}"
cp "${WORK}/pgvector/vector.control" "${OUT_DIR}/"
cp "${WORK}/pgvector/sql/vector--${PGVECTOR_SQL_VERSION}.sql" "${OUT_DIR}/"
cp "${WORK}/pgvector/vector.control" "${SHARED_DIR}/"
cp "${WORK}/pgvector/sql/vector--${PGVECTOR_SQL_VERSION}.sql" "${SHARED_DIR}/"
SHARED_DIR="${PKG_ROOT}/prebuilt"
rm -rf "${OUT_DIR}"
mkdir -p "${OUT_DIR}" "${SHARED_DIR}"
rm -f "${SHARED_DIR}/vector.control"
rm -f "${SHARED_DIR}"/vector--*.sql
# Stage straight from the build dir — we run `make` but not `make install`, so
# nothing lands in the OS Postgres's pkglibdir/sharedir. The library, control
# file, and generated SQL all sit under the cloned/built pgvector tree.
#
# Compiled extension library is the only PLATFORM-SPECIFIC artifact → goes in
# prebuilt/<platform>/ (name differs per platform via $(DLSUFFIX)).
cp "${WORK}/pgvector/vector"*.so "${OUT_DIR}/" 2>/dev/null || true
cp "${WORK}/pgvector/vector"*.dylib "${OUT_DIR}/" 2>/dev/null || true
# Fall back to an installed copy only if the build dir somehow lacks the lib.
if ! ls "${OUT_DIR}"/vector.* >/dev/null 2>&1; then
cp "${PKGLIBDIR}/vector".* "${OUT_DIR}/"
fi
# Control + the full-install SQL for the pinned version are byte-identical
# across platforms, so they're vendored ONCE at the prebuilt root (not per
# platform). CREATE EXTENSION at default_version reads vector--<version>.sql
# directly; the vector--A--B.sql upgrade scripts only run on ALTER EXTENSION
# UPDATE, never on a fresh embedded DB, so we don't ship them.
PGVECTOR_SQL_VERSION="${PGVECTOR_VERSION#v}"
cp "${WORK}/pgvector/vector.control" "${SHARED_DIR}/"
cp "${WORK}/pgvector/sql/vector--${PGVECTOR_SQL_VERSION}.sql" "${SHARED_DIR}/"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/pgvector-embedded/scripts/build.sh` around lines 51 - 76, The shared
prebuilt dir (SHARED_DIR) can retain old vector.control and vector--*.sql files
across version bumps; before copying the new vector.control and
vector--${PGVECTOR_SQL_VERSION}.sql into SHARED_DIR, remove any existing
vector.control and vector--*.sql in SHARED_DIR (e.g., purge files matching
"vector.control" and "vector--*.sql") so only the pinned SQL/control are staged
and no stale artifacts remain.

Comment on lines 49 to +54
export function hasPrebuilt(platform: string = currentPlatformKey()): boolean {
return existsSync(join(prebuiltDir(platform), "vector.control"));
// Platform-specific library + the shared control file (vendored once at root).
return (
hasPrebuiltLibrary(platform) &&
existsSync(join(PREBUILT_ROOT, "vector.control"))
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require shared SQL presence in hasPrebuilt() before reporting artifacts as usable.

Current check validates library + vector.control only. If vector--*.sql is missing, injection still proceeds and the failure is deferred to extension creation at runtime.

Suggested patch
 export function hasPrebuilt(platform: string = currentPlatformKey()): boolean {
   // Platform-specific library + the shared control file (vendored once at root).
+  const hasSharedSql =
+    existsSync(PREBUILT_ROOT) &&
+    readdirSync(PREBUILT_ROOT).some(
+      (f) => f.startsWith("vector--") && f.endsWith(".sql")
+    );
   return (
     hasPrebuiltLibrary(platform) &&
-    existsSync(join(PREBUILT_ROOT, "vector.control"))
+    existsSync(join(PREBUILT_ROOT, "vector.control")) &&
+    hasSharedSql
   );
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/pgvector-embedded/src/index.ts` around lines 49 - 54, The
hasPrebuilt(platform: string = currentPlatformKey()) function currently returns
true if the platform library and the shared control file exist but doesn't
verify the presence of the platform-specific SQL artifact, causing runtime
failures; update hasPrebuilt to also check for the corresponding vector--*.sql
file under PREBUILT_ROOT (e.g., match the platform key to find a file like
"vector--<platform>.sql" or search for files matching "vector--*.sql") before
returning true, using the existing PREBUILT_ROOT constant and helper utilities
(or fs.readdirSync/existsSync) so that hasPrebuiltLibrary(platform) &&
existsSync(join(PREBUILT_ROOT, "vector.control")) && sqlFileExists returns the
final boolean.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 20, 2026

bug_free 55, simplicity 72, slop 8, bugs 1, 1 blockers

Read diff/logs. Typecheck passed. [env] unit suite hit stale connector-sdk dist from connector-deps. Integration failed in embedded PG setup. Exploratory: booted bundled server with DATABASE_URL=file:///tmp/lobu-review-runtime and /health returned 200; embeddings child warned Cannot find module './cjs/index.cjs'.

Blockers

  • integration suite failed: embedded Postgres test backend aborts during initdb with missing @embedded-postgres native lib symlink (libicudata.77.dylib)

Suggested fixes

File Line Change
packages/server/src/__tests__/setup/embedded-postgres-backend.ts 46 Fix the embedded-postgres test bootstrap so a fresh review/test run can initdb reliably; add a preflight/repair for required native library symlinks or avoid the broken native package resolution path before calling pg.initialise().
Full verdict JSON
{
  "bug_free_confidence": 55,
  "bugs": 1,
  "slop": 8,
  "simplicity": 72,
  "blockers": [
    "integration suite failed: embedded Postgres test backend aborts during initdb with missing @embedded-postgres native lib symlink (libicudata.77.dylib)"
  ],
  "change_type": "feat",
  "behavior_change_risk": "high",
  "tests_adequate": false,
  "suggested_fixes": [
    {
      "file": "packages/server/src/__tests__/setup/embedded-postgres-backend.ts",
      "line": 46,
      "change": "Fix the embedded-postgres test bootstrap so a fresh review/test run can initdb reliably; add a preflight/repair for required native library symlinks or avoid the broken native package resolution path before calling pg.initialise()."
    }
  ],
  "notes": "Read diff/logs. Typecheck passed. [env] unit suite hit stale connector-sdk dist from connector-deps. Integration failed in embedded PG setup. Exploratory: booted bundled server with DATABASE_URL=file:///tmp/lobu-review-runtime and /health returned 200; embeddings child warned Cannot find module './cjs/index.cjs'.",
  "categories": {
    "src": 2950,
    "tests": 1516,
    "docs": 90,
    "config": 302,
    "deps": 116,
    "migrations": 100,
    "ci": 140,
    "generated": 922
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants