Skip to content

fix(embeddings): bridge HF_ENDPOINT env var to transformers.js env.remoteHost (#1205)#1252

Merged
magyargergo merged 2 commits into
abhigyanpatwari:mainfrom
azizur100389:fix/hf-endpoint-env-var
May 3, 2026
Merged

fix(embeddings): bridge HF_ENDPOINT env var to transformers.js env.remoteHost (#1205)#1252
magyargergo merged 2 commits into
abhigyanpatwari:mainfrom
azizur100389:fix/hf-endpoint-env-var

Conversation

@azizur100389

@azizur100389 azizur100389 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Reported in #1205 by @VincentZhaoBin: @huggingface/transformers does not read the HF_ENDPOINT environment variable on its own — it expects callers to set env.remoteHost. Both gitnexus embedder entry points only set env.allowLocalModels = false and env.cacheDir before calling pipeline(), so users behind networks where huggingface.co is unreachable (corporate proxies, the GFW, air-gapped mirrors) can't use --embeddings even after exporting the standard HF_ENDPOINT=https://hf-mirror.com. Reporter verified the bridging patch works end-to-end.

Fix

A new helper applyHfEnvOverrides(env) in gitnexus/src/core/embeddings/hf-env.ts centralises both env-var bridges:

Both embedder entry points now call the helper:

- env.cacheDir = process.env.HF_HOME ?? join(os.homedir(), '.cache', 'huggingface');
+ applyHfEnvOverrides(env);

Touched call sites:

  • gitnexus/src/core/embeddings/embedder.ts — analyze pipeline embedder
  • gitnexus/src/mcp/core/embedder.ts — MCP server embedder

Why a helper instead of inline patches in both files

The reporter proposed inline if (process.env.HF_ENDPOINT) { ... } blocks. I extracted to a helper for three reasons consistent with prior maintainer signals:

  1. Existing HF_HOME → cacheDir logic was already duplicated between both files — adding HF_ENDPOINT inline would increase duplication. Extracting matches HaleTom's refactor(setup): migrate all config I/O to mergeJsoncFile #1031 pattern (which centralised mergeJsoncFile and removed dead code) on this exact package.
  2. Testability — the helper is a pure function and is unit-tested directly. Inline code inside async initEmbedder() would require mocking @huggingface/transformers to assert side effects on env. The @claude review on PR fix(group): contract extractors honour .gitnexusignore via shared IgnoreService (#1185) #1247 specifically flagged a "test claims coverage it doesn't exercise" finding; the testable helper avoids that class of issue.
  3. Reporter's normalisation logic preserved verbatimendsWith('/') ? rs : rs + '/'.

The exported applyHfEnvOverrides and HfEnvSubset are marked @internal per the existing codebase convention (mirroring call-processor.ts:1788 and pipeline.ts:48) — they exist to share between the two embedder entry points and to be exercised by tests, not as part of the public package API.

Tests

gitnexus/test/unit/hf-env.test.ts covers all 7 paths (7 tests, all passing):

  • cacheDir defaults to ~/.cache/huggingface when HF_HOME is unset
  • cacheDir respects HF_HOME when set
  • remoteHost is set when HF_ENDPOINT is set, with trailing slash appended if missing
  • remoteHost preserves an existing trailing slash on HF_ENDPOINT
  • remoteHost is left untouched when HF_ENDPOINT is unset (regression guard against future refactors that always assign)

Verification

Check Result
tsc --noEmit ✅ clean
Targeted vitest run test/unit/hf-env.test.ts ✅ 7/7 pass
npm run test:unit (full) ✅ 4812 passed / 10 skipped

The 3 unit-suite failures (2× git-utils.test.ts plus 1× skip-git-cli.test.ts) are pre-existing environment failures on the current upstream/main — verified pre-fix on a clean tree (same 3 failures, unchanged). No CI-relevant regression. The pre-existing git-utils ones are documented Windows tmpdir specifics; the skip-git-cli one is from #1232 and tracked separately.

Why this is safe

Property Status
Backward compatibility ✅ HF_ENDPOINT block only fires when env var is set; HF_HOME path unchanged from pre-fix code
initEmbedder() contract ✅ unchanged; only internal env-bridging changes
Cross-platform os.homedir() + path.join() preserved verbatim from pre-edit code
No new dependency
No GUARDRAILS non-negotiable touched ✅ embeddings policy (#1055) untouched; we only improve access to remote models

Closes #1205.

…moteHost (abhigyanpatwari#1205)

`@huggingface/transformers` does not read `HF_ENDPOINT` on its own —
it reads `env.remoteHost`. Both gitnexus embedder entry points (the
analyze pipeline at `core/embeddings/embedder.ts` and the MCP server
at `mcp/core/embedder.ts`) only set `env.allowLocalModels = false`
and `env.cacheDir` before calling `pipeline()`, so users behind
networks where `huggingface.co` is unreachable (corporate proxies,
the GFW, air-gapped mirrors) cannot use `--embeddings` even after
exporting the standard `HF_ENDPOINT=https://hf-mirror.com`. Reporter
verified the bridging patch works end-to-end (abhigyanpatwari#1205).

Both entry points now call `applyHfEnvOverrides(env)` (new helper at
`core/embeddings/hf-env.ts`) which:
- maps `HF_HOME` → `env.cacheDir` (preserved from the existing logic
  at both call sites — was duplicated; now centralised)
- maps `HF_ENDPOINT` → `env.remoteHost` (abhigyanpatwari#1205), normalising the
  trailing slash because transformers.js builds URLs by string
  concatenation and a missing slash silently falls through to its
  default `huggingface.co/...` host

Helper is a pure function so the bridging logic is unit-testable
without mocking transformers.js. The exported `applyHfEnvOverrides`
and `HfEnvSubset` are marked `@internal` per existing codebase
convention (see `call-processor.ts:1788`, `pipeline.ts:48`) — they
exist to share between the two embedder entry points and to be
exercised by tests, not as part of the public package API.

Tests at `test/unit/hf-env.test.ts` cover all 5 paths: `cacheDir`
default, `cacheDir` from `HF_HOME`, `remoteHost` from `HF_ENDPOINT`
without trailing slash, `remoteHost` from `HF_ENDPOINT` with
trailing slash preserved, and `remoteHost` left untouched when
`HF_ENDPOINT` is unset. Unused `os` and `join` imports removed from
both embedder files now that the path-construction logic lives
exclusively in the helper.
@vercel

vercel Bot commented May 1, 2026

Copy link
Copy Markdown

@azizur100389 is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
7761 7760 0 1 350s

✅ All 7760 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 77.18% 22933/29713 77.02% 📈 +0.2 🟢 ███████████████░░░░░
Branches 66.07% 14730/22292 65.97% 📈 +0.1 🟢 █████████████░░░░░░░
Functions 82.16% 2257/2747 81.86% 📈 +0.3 🟢 ████████████████░░░░
Lines 80.06% 20697/25850 79.89% 📈 +0.2 🟢 ████████████████░░░░

📋 View full run · Generated by CI

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 2m 41s —— View job


PR #1252 Review — fix(embeddings): bridge HF_ENDPOINT env var to transformers.js env.remoteHost

Todo list:

  • Post initial comment
  • Read repo context files (DoD.md, AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md)
  • Read changed files (hf-env.ts, embedder.ts x2, hf-env.test.ts)
  • Read supporting files (config.ts, http-client.ts, types.ts)
  • Inspect hf-env.ts helper
  • Inspect analyze embedder wiring
  • Inspect MCP embedder wiring
  • Check analyze/MCP parity
  • Inspect embedding policy / GUARDRAILS
  • Inspect test coverage
  • Inspect Transformers.js runtime behavior
  • Inspect stale remoteHost behavior
  • Inspect comments and exports
  • Inspect CI/validation
  • Write final review

Review bar for this PR

  • HF_HOME must still map to env.cacheDir exactly as before, with the same default path
  • HF_ENDPOINT must map to env.remoteHost in both analyze and MCP embedder entry points
  • Trailing slash must be appended when missing, preserved when present
  • When HF_ENDPOINT is unset, remoteHost must not be overwritten
  • env.allowLocalModels = false must remain enforced in both embedder entry points
  • HTTP embedding mode must be completely unaffected
  • Tests must cover all 5 helper paths with clean env isolation
  • CI, typecheck, and targeted tests must be green

Issue and PR context considered

#1205 symptom: @huggingface/transformers does not read HF_ENDPOINT automatically — it expects env.remoteHost to be set by callers. Users behind restricted networks (GFW, corporate proxies, HF mirrors) exported HF_ENDPOINT=https://hf-mirror.com but GitNexus still resolved models from huggingface.co.

Changed files: hf-env.ts (new helper), core/embeddings/embedder.ts (analyze entry point), mcp/core/embedder.ts (MCP entry point), test/unit/hf-env.test.ts (5 new tests).

CI status: ✅ Typecheck clean, 7,750 tests passed on 3 platforms, E2E passed, 1 unrelated Ruby skip.

Core risk areas inspected: HF_HOME preservation, HF_ENDPOINT bridging in both entry points, trailing slash normalization, unset-env guard, allowLocalModels policy, HTTP mode isolation, stale remoteHost, test env leakage.


Understanding of the change

@huggingface/transformers JS port exposes env.remoteHost as its configurable base URL for model downloads, but unlike the Python huggingface_hub client it does not bridge the standard HF_ENDPOINT env var on its own. GitNexus previously set env.cacheDir (from HF_HOME) at both embedder entry points but left env.remoteHost untouched, so mirror configurations were silently ignored.

The PR introduces applyHfEnvOverrides(env) in a new hf-env.ts module that (1) reads HF_HOME → env.cacheDir (consolidated from duplicated inline logic) and (2) reads HF_ENDPOINT → env.remoteHost with trailing-slash normalization. Both embedder entry points now call this helper before pipeline(). The helper is marked @internal, uses a local structural type, and has no new runtime dependencies.


Findings

[low] Whitespace HF_ENDPOINT accepted as valid

  • Category: Input validation / silent misconfiguration
  • Files: gitnexus/src/core/embeddings/hf-env.ts:52
  • Issue: if (process.env.HF_ENDPOINT) uses truthiness, so an empty string ('') is correctly ignored (empty string is falsy in JS). However, a whitespace-only value (' ') is truthy and would produce env.remoteHost = ' /', which would silently break URL construction by setting an invalid base host.
  • Why it matters here: Users on restricted networks are precisely the ones who might copy-paste HF_ENDPOINT values from docs/shell scripts with accidental leading/trailing spaces. The failure mode would be a confusing network error rather than a clear misconfiguration message.
  • Recommended fix: Add .trim() before the truthiness check and before use:
    const endpoint = process.env.HF_ENDPOINT?.trim();
    if (endpoint) {
      env.remoteHost = endpoint.endsWith('/') ? endpoint : endpoint + '/';
    }
    Fix this →
  • Blocks merge: No

[low] No test for whitespace HF_ENDPOINT

  • Category: Test coverage
  • Files: gitnexus/test/unit/hf-env.test.ts
  • Issue: The 5 existing tests cover the main paths cleanly but do not assert behavior for HF_ENDPOINT=' '. If .trim() is added to the helper, a test should accompany it. Without .trim(), a test demonstrating the current behavior (whitespace is accepted) would serve as documentation.
  • Why it matters here: The target audience for HF_ENDPOINT support is users in constrained environments; a trim regression would be hard to diagnose.
  • Recommended fix: Add one test:
    it('remoteHost is left untouched when HF_ENDPOINT is whitespace-only', () => {
      process.env.HF_ENDPOINT = '   ';
      envStub.remoteHost = 'sentinel';
      applyHfEnvOverrides(envStub);
      expect(envStub.remoteHost).toBe('sentinel');
    });
  • Blocks merge: No

[low] Documentation discoverability

  • Category: Docs
  • Files: CLI help / README (not in scope of this PR's changed files)
  • Issue: Users won't know HF_ENDPOINT is now supported unless it's surfaced in --help output or the embeddings configuration docs. The existing HF_HOME support is presumably undocumented too, so this is consistent with current project practice.
  • Recommended fix: In a follow-up, add HF_ENDPOINT and HF_HOME to whatever env-var reference the project maintains (e.g. README.md embeddings section or --help text). Not required to unblock this fix.
  • Blocks merge: No

Definition of Done check

DoD item Status
HF_HOME still maps to env.cacheDir exactly as before ✅ satisfied — line 51 is identical to the pre-existing inline logic
Default cache resolves to join(os.homedir(), '.cache', 'huggingface') ✅ satisfied — verified in helper and matching test
HF_ENDPOINT maps to env.remoteHost in analyze embedder ✅ satisfied — core/embeddings/embedder.ts:165
HF_ENDPOINT maps to env.remoteHost in MCP embedder ✅ satisfied — mcp/core/embedder.ts:51
Trailing slash appended when missing, preserved when present ✅ satisfied — hf-env.ts:53–55, covered by tests 3 and 4
remoteHost untouched when HF_ENDPOINT is unset ✅ satisfied — if (process.env.HF_ENDPOINT) guard, covered by test 5
env.allowLocalModels = false remains in both entry points ✅ satisfied — embedder.ts:160, mcp/core/embedder.ts:46
HTTP embedding mode behavior unchanged ✅ satisfied — both entry points throw early in HTTP mode before reaching applyHfEnvOverrides
Tests cover helper behavior with clean env isolation ✅ satisfied — 5 tests, proper beforeEach/afterEach env save/restore
CI, typecheck, and focused tests are green ✅ satisfied — 7,750/7,750 on 3 platforms

Embedding env assessment

Property Assessment
HF_HOME Preserved verbatim. env.cacheDir = process.env.HF_HOME ?? join(os.homedir(), '.cache', 'huggingface') is identical to the pre-existing inline logic at both call sites.
Default cache dir Resolves to ~/.cache/huggingface when HF_HOME is unset. Cross-platform via os.homedir() + path.join. ✅
HF_ENDPOINT Bridged to env.remoteHost in both entry points. First-time-in-process behavior is correct. ✅
Trailing slash normalization endsWith('/') ? value : value + '/' is correct for Transformers.js, which builds download URLs by string concatenation (missing slash silently misroutes to default host). ✅
Unset HF_ENDPOINT if (process.env.HF_ENDPOINT) guard leaves existing remoteHost untouched. Empty string ('') is falsy → correctly ignored. ✅
Stale remoteHost Both entry points use initPromise / embedderInstance singleton guards, so applyHfEnvOverrides is called at most once per process. Consistent with how HF_HOME was handled before. Not a regression. ✅
Whitespace values ' ' is truthy → would set remoteHost = ' /'. Low-severity: obvious failure, user error, but worth a .trim() guard (see Findings).

Analyze/MCP parity assessment

Property Analyze (core/embeddings/embedder.ts) MCP (mcp/core/embedder.ts) Parity
Imports helper ✅ line 25 ✅ line 15
allowLocalModels = false ✅ line 160 ✅ line 46
applyHfEnvOverrides(env) before pipeline() ✅ line 165 ✅ line 51
HF_HOME duplication removed ✅ no inline cacheDir assignment ✅ no inline cacheDir assignment
HTTP mode unchanged ✅ throws at line 129 before env setup ✅ throws at line 30 before env setup
One-time init / initPromise ✅ singleton pattern intact ✅ singleton pattern intact

Both paths are symmetrically wired. No divergence.


Test assessment

Aspect Assessment
Helper tests: 5/5 paths covered ✅ default cache, custom cache, slash appended, slash preserved, remoteHost untouched
Env var isolation beforeEach snapshots both HF_HOME and HF_ENDPOINT; afterEach restores precisely (handles both undefined and set cases)
Cross-platform path expectation ✅ test uses join(os.homedir(), '.cache', 'huggingface') — same expression as helper
Call-site wiring confirmed by code review ✅ Both call sites verified directly; no Transformers.js mock needed
Missing: whitespace HF_ENDPOINT ⚠️ low-severity gap; see Findings
Missing: empty string HF_ENDPOINT Acceptable — '' is falsy, behavior is correct by language semantics

Validation assessment

Check Result
Latest CI head SHA 0335456 ✅ matches PR head commit
Typecheck (tsc --noEmit) ✅ clean
Targeted hf-env.test.ts (5 tests) ✅ included in full suite run
Full unit suite (7,750 tests, 3 platforms) ✅ all passed
E2E ✅ passed
Coverage delta 📈 +0.1% functions — expected and proportionate
Skipped tests 1 unrelated Ruby block-parameter skip — pre-existing, not related
Pre-existing local git-utils failures Superseded by green CI on all 3 platforms

Final verdict

production-ready with minor follow-ups

Both embedder entry points are correctly and symmetrically wired to applyHfEnvOverrides. The HF_HOME → cacheDir preservation is exact and tested; the new HF_ENDPOINT → remoteHost bridge is correct, guarded, and tested across all meaningful paths. env.allowLocalModels = false is untouched in both files, HTTP mode is unaffected, and the singleton init pattern means env is read exactly once per process — the same guarantee that existed before. CI is green on all 3 platforms with 7,750 tests passing.

The two low-severity follow-up items (.trim() on HF_ENDPOINT, corresponding whitespace test) are user-error edge cases that do not affect normal mirror usage and do not block merge. A docs mention of HF_ENDPOINT support would improve discoverability for the target audience (restricted-network users) but is also not a blocker given that HF_HOME is similarly undocumented today.


· PR branch

…bhigyanpatwari#1205)

Addresses the [low] whitespace finding from the @claude review on
PR abhigyanpatwari#1252. The original guard used bare truthiness on
`process.env.HF_ENDPOINT`, which correctly skips the empty string
(falsy in JS) but treated whitespace-only values like `'   '` as
truthy and produced `env.remoteHost = '   /'` — a silently invalid
base host that would misroute every model download instead of
failing loudly.

The target audience for `HF_ENDPOINT` is precisely the users most
likely to hit this: people copy-pasting mirror URLs from shell
scripts, docs, or chat threads where leading/trailing whitespace is
common. Adding `.trim()` makes the guard a single, reliable rule
("unset, empty, or pure whitespace ⇒ leave remoteHost alone") and
collapses both edge cases into the existing untouched-on-unset
behaviour.

Two new tests added:
  - HF_ENDPOINT='   ' leaves a sentinel remoteHost untouched
  - HF_ENDPOINT='  https://hf-mirror.com  ' trims to the canonical
    'https://hf-mirror.com/' (slash still appended after trim)

The remaining [low] documentation finding (mention HF_ENDPOINT in
README / --help) is deferred — the bot itself flagged it as
out-of-scope for this PR ("Not required to unblock this fix"), and
HF_HOME is similarly undocumented today; touching that surface is a
separate cleanup concern.
@azizur100389

Copy link
Copy Markdown
Contributor Author

Addressed the two [low] code-related findings in db5fd4f6. Thanks @magyargergo for summoning the review — the whitespace edge case is a real footgun for the target audience.

[low] HF_ENDPOINT whitespace. The original if (process.env.HF_ENDPOINT) was correct for empty string (falsy) but truthy on ' ', so a copy-paste from a shell script with stray whitespace would produce env.remoteHost = ' /' and silently misroute every download. Replaced with:

const endpoint = process.env.HF_ENDPOINT?.trim();
if (endpoint) {
  env.remoteHost = endpoint.endsWith('/') ? endpoint : endpoint + '/';
}

.trim() collapses unset / empty / whitespace-only into one consistent "leave remoteHost alone" path, and a non-empty trimmed value gets the same trailing-slash normalisation as before. Inline comment explains the rationale.

[low] No whitespace test. Added two:

  • HF_ENDPOINT=' ' leaves a sentinel remoteHost untouched (proves the trim guard).
  • HF_ENDPOINT=' https://hf-mirror.com ' trims to canonical 'https://hf-mirror.com/' (proves trimming + slash work together).

[low] Documentation. Deferred — the bot itself flagged this as "Not required to unblock this fix," and HF_HOME is similarly undocumented today, so surfacing both env vars is a separate docs-only cleanup. Happy to do that as a follow-up PR if you'd like.

Local verification on db5fd4f6:

  • tsc --noEmit clean
  • Targeted vitest run test/unit/hf-env.test.ts: 7/7 pass (was 5, +2 for the new whitespace cases)
  • Full unit suite expected unchanged (only the helper file + its test changed)

CI re-running.

@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 2m 46s —— View job


PR #1252 Review — fix(embeddings): bridge HF_ENDPOINT env var to transformers.js env.remoteHost

Todo list:

  • Post initial comment
  • Read repo context files (DoD.md, GUARDRAILS.md)
  • Read changed files (hf-env.ts, embedder.ts x2, hf-env.test.ts)
  • Read supporting files (config.ts, http-client.ts, types.ts)
  • Inspect hf-env.ts helper (fields, imports, trim logic, trailing slash, @internal)
  • Inspect analyze embedder wiring
  • Inspect MCP embedder wiring
  • Check analyze/MCP parity
  • Inspect embedding policy / GUARDRAILS
  • Inspect test coverage (7 tests including whitespace cases)
  • Inspect comments and exports
  • Inspect CI/validation
  • Write final review

Review bar for this PR

  • HF_HOME still maps to env.cacheDir with the same default path (~/.cache/huggingface) and identical fallback logic.
  • HF_ENDPOINT maps to env.remoteHost in both the analyze pipeline embedder (core/embeddings/embedder.ts) and the MCP server embedder (mcp/core/embedder.ts).
  • HF_ENDPOINT values are trimmed; whitespace-only and unset values leave remoteHost untouched.
  • A missing trailing slash is appended; an existing trailing slash is preserved.
  • env.allowLocalModels = false is enforced in both embedder entry points before pipeline().
  • HTTP embedding mode (isHttpMode()) is unaffected — both entry points throw before reaching applyHfEnvOverrides.
  • Tests cover all meaningful paths (default, override, unset, whitespace-only, padded, trailing-slash) with proper env isolation.
  • CI, typecheck, and focused tests are green on the latest head SHA (db5fd4f).

Issue and PR context considered

#1205 symptom: @huggingface/transformers JS port does not read HF_ENDPOINT automatically — it exposes env.remoteHost for callers to set. GitNexus only set env.cacheDir (from HF_HOME) and env.allowLocalModels = false, so users behind corporate proxies, GFW, or HF mirrors who exported HF_ENDPOINT=https://hf-mirror.com saw no effect — models were still fetched from huggingface.co.

Changed files: hf-env.ts (new helper), core/embeddings/embedder.ts (analyze entry point), mcp/core/embedder.ts (MCP entry point), test/unit/hf-env.test.ts (7 tests).

Current CI status: ✅ Both CI and Docker Build & Push runs passed at 2026-05-01T17:01:05Z — this corresponds to the latest head db5fd4f (the whitespace trim follow-up commit pushed by the author in response to the earlier Claude review). Typecheck, 3-platform unit suite, and E2E all green.

Previous whitespace finding status: ✅ Fully addressed. The prior review flagged HF_ENDPOINT=' ' becoming env.remoteHost = ' /'. The fix (const endpoint = process.env.HF_ENDPOINT?.trim(); if (endpoint) { ... }) is present and verified in hf-env.ts:58–61. Two accompanying tests (whitespace-only, padded-with-content) are in hf-env.test.ts:62–83.


Understanding of the change

@huggingface/transformers (the JS port) reads env.remoteHost as its configurable base URL for model downloads, but unlike the Python huggingface_hub client it does not bridge the standard HF_ENDPOINT env var automatically. GitNexus previously set env.cacheDir and env.allowLocalModels at both embedder call sites but left env.remoteHost at its hardcoded default (https://huggingface.co/), silently ignoring any mirror configured by the user.

The PR introduces applyHfEnvOverrides(env) in a new hf-env.ts module that (1) reads HF_HOME → env.cacheDir — consolidating duplicated inline logic that existed in both embedder files — and (2) reads HF_ENDPOINT → env.remoteHost with .trim() normalization and trailing-slash appending. Both embedder entry points now call this helper before pipeline(). The HfEnvSubset structural type keeps the helper free of a transitive dependency on @huggingface/transformers while maintaining full TypeScript type coverage. Both exports are marked @internal. What must remain unchanged — env.allowLocalModels = false, HTTP mode early-exit behavior, ORT log level suppression, model selection, and singleton init semantics — is verified untouched.


Findings

[low] PR body validation table is stale after the whitespace follow-up

  • Category: Documentation accuracy
  • Files: PR body
  • Issue: The PR body reads "Targeted vitest run test/unit/hf-env.test.ts: ✅ 5/5 pass" and "5 tests, all passing" in the Tests section. hf-env.test.ts now contains 7 tests (original 5 + 2 whitespace cases added in db5fd4f). The PR body was not updated when the author addressed the prior Claude review findings.
  • Why it matters here: The PR description is the canonical public record of what was verified. A reader who skips the follow-up comment will undercount the test coverage.
  • Recommended fix: Update the PR body: 5/57/7, and the "5 tests" reference in the Tests narrative → "7 tests".
  • Blocks merge: No — the tests are present and green in CI; this is a docs-only discrepancy.

[low] No combined HF_HOME + HF_ENDPOINT test

  • Category: Test coverage completeness
  • Files: gitnexus/test/unit/hf-env.test.ts
  • Issue: The 7 existing tests verify each assignment independently but do not have a single test that sets both HF_HOME and HF_ENDPOINT together and asserts both cacheDir and remoteHost are set correctly. The two helper paths (HF_HOME and HF_ENDPOINT) are independent branches in applyHfEnvOverrides, so there is no correctness risk from the omission — but a combined test would lock in the "both can coexist" contract as a regression guard.
  • Why it matters here: Users in restricted-network environments are exactly those who may need both vars simultaneously (HF_HOME for a custom cache on a network-mounted drive, HF_ENDPOINT for a mirror).
  • Recommended fix (non-blocking):
    it('sets both cacheDir and remoteHost when both HF_HOME and HF_ENDPOINT are set', () => {
      process.env.HF_HOME = '/tmp/hf';
      process.env.HF_ENDPOINT = 'https://hf-mirror.com';
      applyHfEnvOverrides(envStub);
      expect(envStub.cacheDir).toBe('/tmp/hf');
      expect(envStub.remoteHost).toBe('https://hf-mirror.com/');
    });
  • Blocks merge: No.

Definition of Done check

DoD item Status
HF_HOME still maps to env.cacheDir exactly as before, same default path ✅ satisfied — hf-env.ts:51 is identical to the pre-existing inline logic at both call sites
HF_ENDPOINT maps to env.remoteHost in analyze embedder ✅ satisfied — core/embeddings/embedder.ts:165
HF_ENDPOINT maps to env.remoteHost in MCP embedder ✅ satisfied — mcp/core/embedder.ts:51
HF_ENDPOINT trimmed; whitespace-only and unset leave remoteHost untouched ✅ satisfied — hf-env.ts:58–61, tested by tests 5, 6, 7
Trailing slash appended when missing, preserved when present ✅ satisfied — hf-env.ts:60, tested by tests 3 and 4
env.allowLocalModels = false enforced in both entry points ✅ satisfied — core/embeddings/embedder.ts:160, mcp/core/embedder.ts:46
HTTP embedding mode unaffected ✅ satisfied — both entry points throw before applyHfEnvOverrides when isHttpMode() is true; HTTP client uses GITNEXUS_EMBEDDING_URL/GITNEXUS_EMBEDDING_MODEL, completely separate
Tests cover all meaningful paths with clean env isolation ✅ satisfied — 7 tests, beforeEach deletes both vars, afterEach restores precisely
CI, typecheck, focused tests green ✅ satisfied — latest CI runs (id: 25223884736, 25223884699) both passed at 2026-05-01T17:01:05Z on head db5fd4f

Embedding behavior assessment

Property Assessment
HF_HOME Preserved verbatim. env.cacheDir = process.env.HF_HOME ?? join(os.homedir(), '.cache', 'huggingface') is identical to the pre-existing inline logic. No trim applied to HF_HOME — intentional, preserving previous behavior. ✅
Default cache dir Resolves to ~/.cache/huggingface via os.homedir() + path.join. Cross-platform. Tested directly. ✅
HF_ENDPOINT Bridged to env.remoteHost via trimmed local variable. First-time-in-process only, due to singleton guards. ✅
Trailing slash endsWith('/') ? endpoint : endpoint + '/' is correct — transformers.js builds URLs by string concatenation; a missing slash silently misroutes to the default host. ✅
Whitespace handling ?.trim() collapses ' ' to '' (falsy) → guard skips assignment. ' https://hf-mirror.com ''https://hf-mirror.com' → gets slash appended → 'https://hf-mirror.com/'. ✅
Unset HF_ENDPOINT process.env.HF_ENDPOINT?.trim() returns undefined → falsy → remoteHost untouched. ✅
Analyze embedder allowLocalModels = false at line 160, applyHfEnvOverrides(env) at line 165, before pipeline() call. Old inline env.cacheDir = ... removed cleanly. os import removed (was used only for os.homedir() inline; now lives in hf-env.ts). ✅
MCP embedder allowLocalModels = false at line 46, applyHfEnvOverrides(env) at line 51, before pipeline(). silenceStdout/restoreStdout logic unchanged. ✅
HTTP embedding mode Both entry points guard with if (isHttpMode()) throw before any transformers.js env mutation. HTTP client is driven by GITNEXUS_EMBEDDING_URL/GITNEXUS_EMBEDDING_MODEL — completely orthogonal to HF_ENDPOINT. ✅
Local model policy env.allowLocalModels = false in both entry points is untouched. HF_ENDPOINT only changes the remote base URL; it does not enable local model loading. ✅

Analyze/MCP parity assessment

Property Analyze (core/embeddings/embedder.ts) MCP (mcp/core/embedder.ts) Parity
Imports helper ✅ line 25 ✅ line 15
allowLocalModels = false ✅ line 160 ✅ line 46
applyHfEnvOverrides(env) before pipeline() ✅ line 165 ✅ line 51
HF_HOME duplication removed ✅ no inline cacheDir assignment ✅ no inline cacheDir assignment
HTTP mode guard fires before env setup ✅ throws at lines 129–134 ✅ throws at lines 30–32
Singleton init pattern intact initPromise/embedderInstance initPromise/embedderInstance

Both paths are symmetrically wired. No divergence.


Test assessment

Aspect Assessment
7/7 paths covered ✅ default cache, custom cache, slash appended, slash preserved, remoteHost untouched (unset), remoteHost untouched (whitespace), padded value trimmed + slash
Env isolation beforeEach snapshots both vars and deletes them; afterEach restores precisely (handles undefined vs set)
Whitespace tests ✅ both cases added in db5fd4f — whitespace-only sentinel check and padded-URL trim+normalize
Cross-platform path ✅ test uses join(os.homedir(), '.cache', 'huggingface') — identical expression to helper
Tests would fail on old behavior ✅ — test 6 (whitespace sentinel) and test 7 (padded URL) would fail against the original code before db5fd4f
Missing combined HF_HOME + HF_ENDPOINT test ⚠️ low-severity gap; see Findings

Validation assessment

Check Result
Latest CI head SHA db5fd4f ✅ CI runs (id: 25223884736 / 25223884699) created at 2026-05-01T17:01:05Z — matches the whitespace-fix commit push time
Typecheck (tsc --noEmit) ✅ clean per CI pipeline
Unit tests (3 platforms) ✅ all passed
E2E ✅ passed
Docker Build & Push ✅ passed
Coverage 📈 positive delta from prior review CI report — no regression
Skipped tests 1 unrelated Ruby block-parameter skip — pre-existing
Manual mirror smoke test Not performed — helper behavior is deterministic and the original reporter verified end-to-end; a mirror smoke would only add confidence, not correctness evidence
PR body validation table ⚠️ still reads "5/5 pass" — stale; CI run itself is authoritative

Final verdict

production-ready with minor follow-ups

Both embedder entry points are correctly and symmetrically wired to applyHfEnvOverrides. The whitespace issue flagged in the prior review has been fully addressed: hf-env.ts:58 uses .trim(), the guard is correctly placed, and two dedicated tests (whitespace-only and padded-value) accompany the fix. env.allowLocalModels = false is untouched in both files, HTTP mode is unaffected, the singleton init pattern is preserved, and the HF_HOME → cacheDir default path is exact and tested. CI is green on the latest head (db5fd4f) on all three platforms.

The two remaining items — a stale "5/5" reference in the PR body and the absence of a combined HF_HOME + HF_ENDPOINT test — are documentation and completeness niceties that do not affect correctness or production behavior. Neither blocks merge. Updating the PR body to reflect "7/7" is a one-line edit worth doing before merging for historical accuracy, but it should not hold up the fix for users behind restricted networks.

--- · PR branch

@magyargergo magyargergo merged commit db22a89 into abhigyanpatwari:main May 3, 2026
33 of 34 checks passed
evander-wang pushed a commit to evander-wang/GitNexus that referenced this pull request May 3, 2026
@magyargergo magyargergo mentioned this pull request May 10, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HF_ENDPOINT env var is not honored, embedding model download fails behind GFW / private mirror

2 participants