fix(embeddings): bridge HF_ENDPOINT env var to transformers.js env.remoteHost (#1205)#1252
Conversation
…moteHost (abhigyanpatwari#1205) `@huggingface/transformers` does not read `HF_ENDPOINT` on its own — it reads `env.remoteHost`. Both gitnexus embedder entry points (the analyze pipeline at `core/embeddings/embedder.ts` and the MCP server at `mcp/core/embedder.ts`) only set `env.allowLocalModels = false` and `env.cacheDir` before calling `pipeline()`, so users behind networks where `huggingface.co` is unreachable (corporate proxies, the GFW, air-gapped mirrors) cannot use `--embeddings` even after exporting the standard `HF_ENDPOINT=https://hf-mirror.com`. Reporter verified the bridging patch works end-to-end (abhigyanpatwari#1205). Both entry points now call `applyHfEnvOverrides(env)` (new helper at `core/embeddings/hf-env.ts`) which: - maps `HF_HOME` → `env.cacheDir` (preserved from the existing logic at both call sites — was duplicated; now centralised) - maps `HF_ENDPOINT` → `env.remoteHost` (abhigyanpatwari#1205), normalising the trailing slash because transformers.js builds URLs by string concatenation and a missing slash silently falls through to its default `huggingface.co/...` host Helper is a pure function so the bridging logic is unit-testable without mocking transformers.js. The exported `applyHfEnvOverrides` and `HfEnvSubset` are marked `@internal` per existing codebase convention (see `call-processor.ts:1788`, `pipeline.ts:48`) — they exist to share between the two embedder entry points and to be exercised by tests, not as part of the public package API. Tests at `test/unit/hf-env.test.ts` cover all 5 paths: `cacheDir` default, `cacheDir` from `HF_HOME`, `remoteHost` from `HF_ENDPOINT` without trailing slash, `remoteHost` from `HF_ENDPOINT` with trailing slash preserved, and `remoteHost` left untouched when `HF_ENDPOINT` is unset. Unused `os` and `join` imports removed from both embedder files now that the path-construction logic lives exclusively in the helper.
|
@azizur100389 is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 7760 tests passed 1 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
|
Claude finished @magyargergo's task in 2m 41s —— View job PR #1252 Review —
|
| DoD item | Status |
|---|---|
HF_HOME still maps to env.cacheDir exactly as before |
✅ satisfied — line 51 is identical to the pre-existing inline logic |
Default cache resolves to join(os.homedir(), '.cache', 'huggingface') |
✅ satisfied — verified in helper and matching test |
HF_ENDPOINT maps to env.remoteHost in analyze embedder |
✅ satisfied — core/embeddings/embedder.ts:165 |
HF_ENDPOINT maps to env.remoteHost in MCP embedder |
✅ satisfied — mcp/core/embedder.ts:51 |
| Trailing slash appended when missing, preserved when present | ✅ satisfied — hf-env.ts:53–55, covered by tests 3 and 4 |
remoteHost untouched when HF_ENDPOINT is unset |
✅ satisfied — if (process.env.HF_ENDPOINT) guard, covered by test 5 |
env.allowLocalModels = false remains in both entry points |
✅ satisfied — embedder.ts:160, mcp/core/embedder.ts:46 |
| HTTP embedding mode behavior unchanged | ✅ satisfied — both entry points throw early in HTTP mode before reaching applyHfEnvOverrides |
| Tests cover helper behavior with clean env isolation | ✅ satisfied — 5 tests, proper beforeEach/afterEach env save/restore |
| CI, typecheck, and focused tests are green | ✅ satisfied — 7,750/7,750 on 3 platforms |
Embedding env assessment
| Property | Assessment |
|---|---|
HF_HOME |
Preserved verbatim. env.cacheDir = process.env.HF_HOME ?? join(os.homedir(), '.cache', 'huggingface') is identical to the pre-existing inline logic at both call sites. |
| Default cache dir | Resolves to ~/.cache/huggingface when HF_HOME is unset. Cross-platform via os.homedir() + path.join. ✅ |
HF_ENDPOINT |
Bridged to env.remoteHost in both entry points. First-time-in-process behavior is correct. ✅ |
| Trailing slash normalization | endsWith('/') ? value : value + '/' is correct for Transformers.js, which builds download URLs by string concatenation (missing slash silently misroutes to default host). ✅ |
Unset HF_ENDPOINT |
if (process.env.HF_ENDPOINT) guard leaves existing remoteHost untouched. Empty string ('') is falsy → correctly ignored. ✅ |
Stale remoteHost |
Both entry points use initPromise / embedderInstance singleton guards, so applyHfEnvOverrides is called at most once per process. Consistent with how HF_HOME was handled before. Not a regression. ✅ |
| Whitespace values | ' ' is truthy → would set remoteHost = ' /'. Low-severity: obvious failure, user error, but worth a .trim() guard (see Findings). |
Analyze/MCP parity assessment
| Property | Analyze (core/embeddings/embedder.ts) |
MCP (mcp/core/embedder.ts) |
Parity |
|---|---|---|---|
| Imports helper | ✅ line 25 | ✅ line 15 | ✅ |
allowLocalModels = false |
✅ line 160 | ✅ line 46 | ✅ |
applyHfEnvOverrides(env) before pipeline() |
✅ line 165 | ✅ line 51 | ✅ |
HF_HOME duplication removed |
✅ no inline cacheDir assignment |
✅ no inline cacheDir assignment |
✅ |
| HTTP mode unchanged | ✅ throws at line 129 before env setup | ✅ throws at line 30 before env setup | ✅ |
One-time init / initPromise |
✅ singleton pattern intact | ✅ singleton pattern intact | ✅ |
Both paths are symmetrically wired. No divergence.
Test assessment
| Aspect | Assessment |
|---|---|
| Helper tests: 5/5 paths covered | ✅ default cache, custom cache, slash appended, slash preserved, remoteHost untouched |
| Env var isolation | ✅ beforeEach snapshots both HF_HOME and HF_ENDPOINT; afterEach restores precisely (handles both undefined and set cases) |
| Cross-platform path expectation | ✅ test uses join(os.homedir(), '.cache', 'huggingface') — same expression as helper |
| Call-site wiring confirmed by code review | ✅ Both call sites verified directly; no Transformers.js mock needed |
Missing: whitespace HF_ENDPOINT |
|
Missing: empty string HF_ENDPOINT |
Acceptable — '' is falsy, behavior is correct by language semantics |
Validation assessment
| Check | Result |
|---|---|
Latest CI head SHA 0335456 |
✅ matches PR head commit |
Typecheck (tsc --noEmit) |
✅ clean |
Targeted hf-env.test.ts (5 tests) |
✅ included in full suite run |
| Full unit suite (7,750 tests, 3 platforms) | ✅ all passed |
| E2E | ✅ passed |
| Coverage delta | 📈 +0.1% functions — expected and proportionate |
| Skipped tests | 1 unrelated Ruby block-parameter skip — pre-existing, not related |
Pre-existing local git-utils failures |
Superseded by green CI on all 3 platforms |
Final verdict
production-ready with minor follow-ups
Both embedder entry points are correctly and symmetrically wired to applyHfEnvOverrides. The HF_HOME → cacheDir preservation is exact and tested; the new HF_ENDPOINT → remoteHost bridge is correct, guarded, and tested across all meaningful paths. env.allowLocalModels = false is untouched in both files, HTTP mode is unaffected, and the singleton init pattern means env is read exactly once per process — the same guarantee that existed before. CI is green on all 3 platforms with 7,750 tests passing.
The two low-severity follow-up items (.trim() on HF_ENDPOINT, corresponding whitespace test) are user-error edge cases that do not affect normal mirror usage and do not block merge. A docs mention of HF_ENDPOINT support would improve discoverability for the target audience (restricted-network users) but is also not a blocker given that HF_HOME is similarly undocumented today.
…bhigyanpatwari#1205) Addresses the [low] whitespace finding from the @claude review on PR abhigyanpatwari#1252. The original guard used bare truthiness on `process.env.HF_ENDPOINT`, which correctly skips the empty string (falsy in JS) but treated whitespace-only values like `' '` as truthy and produced `env.remoteHost = ' /'` — a silently invalid base host that would misroute every model download instead of failing loudly. The target audience for `HF_ENDPOINT` is precisely the users most likely to hit this: people copy-pasting mirror URLs from shell scripts, docs, or chat threads where leading/trailing whitespace is common. Adding `.trim()` makes the guard a single, reliable rule ("unset, empty, or pure whitespace ⇒ leave remoteHost alone") and collapses both edge cases into the existing untouched-on-unset behaviour. Two new tests added: - HF_ENDPOINT=' ' leaves a sentinel remoteHost untouched - HF_ENDPOINT=' https://hf-mirror.com ' trims to the canonical 'https://hf-mirror.com/' (slash still appended after trim) The remaining [low] documentation finding (mention HF_ENDPOINT in README / --help) is deferred — the bot itself flagged it as out-of-scope for this PR ("Not required to unblock this fix"), and HF_HOME is similarly undocumented today; touching that surface is a separate cleanup concern.
|
Addressed the two [low] code-related findings in [low] const endpoint = process.env.HF_ENDPOINT?.trim();
if (endpoint) {
env.remoteHost = endpoint.endsWith('/') ? endpoint : endpoint + '/';
}
[low] No whitespace test. Added two:
[low] Documentation. Deferred — the bot itself flagged this as "Not required to unblock this fix," and Local verification on
CI re-running. |
|
Claude finished @magyargergo's task in 2m 46s —— View job PR #1252 Review —
|
| DoD item | Status |
|---|---|
HF_HOME still maps to env.cacheDir exactly as before, same default path |
✅ satisfied — hf-env.ts:51 is identical to the pre-existing inline logic at both call sites |
HF_ENDPOINT maps to env.remoteHost in analyze embedder |
✅ satisfied — core/embeddings/embedder.ts:165 |
HF_ENDPOINT maps to env.remoteHost in MCP embedder |
✅ satisfied — mcp/core/embedder.ts:51 |
HF_ENDPOINT trimmed; whitespace-only and unset leave remoteHost untouched |
✅ satisfied — hf-env.ts:58–61, tested by tests 5, 6, 7 |
| Trailing slash appended when missing, preserved when present | ✅ satisfied — hf-env.ts:60, tested by tests 3 and 4 |
env.allowLocalModels = false enforced in both entry points |
✅ satisfied — core/embeddings/embedder.ts:160, mcp/core/embedder.ts:46 |
| HTTP embedding mode unaffected | ✅ satisfied — both entry points throw before applyHfEnvOverrides when isHttpMode() is true; HTTP client uses GITNEXUS_EMBEDDING_URL/GITNEXUS_EMBEDDING_MODEL, completely separate |
| Tests cover all meaningful paths with clean env isolation | ✅ satisfied — 7 tests, beforeEach deletes both vars, afterEach restores precisely |
| CI, typecheck, focused tests green | ✅ satisfied — latest CI runs (id: 25223884736, 25223884699) both passed at 2026-05-01T17:01:05Z on head db5fd4f |
Embedding behavior assessment
| Property | Assessment |
|---|---|
HF_HOME |
Preserved verbatim. env.cacheDir = process.env.HF_HOME ?? join(os.homedir(), '.cache', 'huggingface') is identical to the pre-existing inline logic. No trim applied to HF_HOME — intentional, preserving previous behavior. ✅ |
| Default cache dir | Resolves to ~/.cache/huggingface via os.homedir() + path.join. Cross-platform. Tested directly. ✅ |
HF_ENDPOINT |
Bridged to env.remoteHost via trimmed local variable. First-time-in-process only, due to singleton guards. ✅ |
| Trailing slash | endsWith('/') ? endpoint : endpoint + '/' is correct — transformers.js builds URLs by string concatenation; a missing slash silently misroutes to the default host. ✅ |
| Whitespace handling | ?.trim() collapses ' ' to '' (falsy) → guard skips assignment. ' https://hf-mirror.com ' → 'https://hf-mirror.com' → gets slash appended → 'https://hf-mirror.com/'. ✅ |
Unset HF_ENDPOINT |
process.env.HF_ENDPOINT?.trim() returns undefined → falsy → remoteHost untouched. ✅ |
| Analyze embedder | allowLocalModels = false at line 160, applyHfEnvOverrides(env) at line 165, before pipeline() call. Old inline env.cacheDir = ... removed cleanly. os import removed (was used only for os.homedir() inline; now lives in hf-env.ts). ✅ |
| MCP embedder | allowLocalModels = false at line 46, applyHfEnvOverrides(env) at line 51, before pipeline(). silenceStdout/restoreStdout logic unchanged. ✅ |
| HTTP embedding mode | Both entry points guard with if (isHttpMode()) throw before any transformers.js env mutation. HTTP client is driven by GITNEXUS_EMBEDDING_URL/GITNEXUS_EMBEDDING_MODEL — completely orthogonal to HF_ENDPOINT. ✅ |
| Local model policy | env.allowLocalModels = false in both entry points is untouched. HF_ENDPOINT only changes the remote base URL; it does not enable local model loading. ✅ |
Analyze/MCP parity assessment
| Property | Analyze (core/embeddings/embedder.ts) |
MCP (mcp/core/embedder.ts) |
Parity |
|---|---|---|---|
| Imports helper | ✅ line 25 | ✅ line 15 | ✅ |
allowLocalModels = false |
✅ line 160 | ✅ line 46 | ✅ |
applyHfEnvOverrides(env) before pipeline() |
✅ line 165 | ✅ line 51 | ✅ |
HF_HOME duplication removed |
✅ no inline cacheDir assignment |
✅ no inline cacheDir assignment |
✅ |
| HTTP mode guard fires before env setup | ✅ throws at lines 129–134 | ✅ throws at lines 30–32 | ✅ |
| Singleton init pattern intact | ✅ initPromise/embedderInstance |
✅ initPromise/embedderInstance |
✅ |
Both paths are symmetrically wired. No divergence.
Test assessment
| Aspect | Assessment |
|---|---|
| 7/7 paths covered | ✅ default cache, custom cache, slash appended, slash preserved, remoteHost untouched (unset), remoteHost untouched (whitespace), padded value trimmed + slash |
| Env isolation | ✅ beforeEach snapshots both vars and deletes them; afterEach restores precisely (handles undefined vs set) |
| Whitespace tests | ✅ both cases added in db5fd4f — whitespace-only sentinel check and padded-URL trim+normalize |
| Cross-platform path | ✅ test uses join(os.homedir(), '.cache', 'huggingface') — identical expression to helper |
| Tests would fail on old behavior | ✅ — test 6 (whitespace sentinel) and test 7 (padded URL) would fail against the original code before db5fd4f |
Missing combined HF_HOME + HF_ENDPOINT test |
Validation assessment
| Check | Result |
|---|---|
Latest CI head SHA db5fd4f |
✅ CI runs (id: 25223884736 / 25223884699) created at 2026-05-01T17:01:05Z — matches the whitespace-fix commit push time |
Typecheck (tsc --noEmit) |
✅ clean per CI pipeline |
| Unit tests (3 platforms) | ✅ all passed |
| E2E | ✅ passed |
| Docker Build & Push | ✅ passed |
| Coverage | 📈 positive delta from prior review CI report — no regression |
| Skipped tests | 1 unrelated Ruby block-parameter skip — pre-existing |
| Manual mirror smoke test | Not performed — helper behavior is deterministic and the original reporter verified end-to-end; a mirror smoke would only add confidence, not correctness evidence |
| PR body validation table |
Final verdict
production-ready with minor follow-ups
Both embedder entry points are correctly and symmetrically wired to applyHfEnvOverrides. The whitespace issue flagged in the prior review has been fully addressed: hf-env.ts:58 uses .trim(), the guard is correctly placed, and two dedicated tests (whitespace-only and padded-value) accompany the fix. env.allowLocalModels = false is untouched in both files, HTTP mode is unaffected, the singleton init pattern is preserved, and the HF_HOME → cacheDir default path is exact and tested. CI is green on the latest head (db5fd4f) on all three platforms.
The two remaining items — a stale "5/5" reference in the PR body and the absence of a combined HF_HOME + HF_ENDPOINT test — are documentation and completeness niceties that do not affect correctness or production behavior. Neither blocks merge. Updating the PR body to reflect "7/7" is a one-line edit worth doing before merging for historical accuracy, but it should not hold up the fix for users behind restricted networks.
--- · PR branch
Summary
Reported in #1205 by @VincentZhaoBin:
@huggingface/transformersdoes not read theHF_ENDPOINTenvironment variable on its own — it expects callers to setenv.remoteHost. Both gitnexus embedder entry points only setenv.allowLocalModels = falseandenv.cacheDirbefore callingpipeline(), so users behind networks wherehuggingface.cois unreachable (corporate proxies, the GFW, air-gapped mirrors) can't use--embeddingseven after exporting the standardHF_ENDPOINT=https://hf-mirror.com. Reporter verified the bridging patch works end-to-end.Fix
A new helper
applyHfEnvOverrides(env)ingitnexus/src/core/embeddings/hf-env.tscentralises both env-var bridges:HF_HOME→env.cacheDir(preserved from the existing logic at both call sites — was duplicated; now centralised)HF_ENDPOINT→env.remoteHost(HF_ENDPOINT env var is not honored, embedding model download fails behind GFW / private mirror #1205), normalising the trailing slash because transformers.js builds URLs by string concatenation and a missing slash silently falls through to its defaulthuggingface.co/...hostBoth embedder entry points now call the helper:
Touched call sites:
gitnexus/src/core/embeddings/embedder.ts— analyze pipeline embeddergitnexus/src/mcp/core/embedder.ts— MCP server embedderWhy a helper instead of inline patches in both files
The reporter proposed inline
if (process.env.HF_ENDPOINT) { ... }blocks. I extracted to a helper for three reasons consistent with prior maintainer signals:HF_HOME → cacheDirlogic was already duplicated between both files — adding HF_ENDPOINT inline would increase duplication. Extracting matches HaleTom's refactor(setup): migrate all config I/O to mergeJsoncFile #1031 pattern (which centralisedmergeJsoncFileand removed dead code) on this exact package.initEmbedder()would require mocking@huggingface/transformersto assert side effects onenv. The @claude review on PR fix(group): contract extractors honour .gitnexusignore via shared IgnoreService (#1185) #1247 specifically flagged a "test claims coverage it doesn't exercise" finding; the testable helper avoids that class of issue.endsWith('/') ? rs : rs + '/'.The exported
applyHfEnvOverridesandHfEnvSubsetare marked@internalper the existing codebase convention (mirroringcall-processor.ts:1788andpipeline.ts:48) — they exist to share between the two embedder entry points and to be exercised by tests, not as part of the public package API.Tests
gitnexus/test/unit/hf-env.test.tscovers all 7 paths (7 tests, all passing):cacheDirdefaults to~/.cache/huggingfacewhenHF_HOMEis unsetcacheDirrespectsHF_HOMEwhen setremoteHostis set whenHF_ENDPOINTis set, with trailing slash appended if missingremoteHostpreserves an existing trailing slash onHF_ENDPOINTremoteHostis left untouched whenHF_ENDPOINTis unset (regression guard against future refactors that always assign)Verification
tsc --noEmitvitest run test/unit/hf-env.test.tsnpm run test:unit(full)The 3 unit-suite failures (2×
git-utils.test.tsplus 1×skip-git-cli.test.ts) are pre-existing environment failures on the currentupstream/main— verified pre-fix on a clean tree (same 3 failures, unchanged). No CI-relevant regression. The pre-existinggit-utilsones are documented Windows tmpdir specifics; theskip-git-clione is from #1232 and tracked separately.Why this is safe
initEmbedder()contractos.homedir()+path.join()preserved verbatim from pre-edit codeCloses #1205.