Skip to content

fix: actionable HF_ENDPOINT guidance, retries, timeout and circuit breaker when embedding model download fails#1419

Merged
magyargergo merged 14 commits into
mainfrom
copilot/fix-mac-load-embedding-failed
May 8, 2026
Merged

fix: actionable HF_ENDPOINT guidance, retries, timeout and circuit breaker when embedding model download fails#1419
magyargergo merged 14 commits into
mainfrom
copilot/fix-mac-load-embedding-failed

Conversation

Copilot AI commented May 7, 2026

Copy link
Copy Markdown
Contributor

gitnexus analyze --embeddings silently crashed with TypeError: fetch failed on networks where huggingface.co is unreachable (GFW, corporate proxies). The codebase already bridged HF_ENDPOINTenv.remoteHost for mirror support, but no error path surfaced that option, and transient network failures were not retried.

Changes

  • hf-env.ts — exports several new helpers and constants:

    • isNetworkFetchError(msg) — detects network-level fetch failures (fetch failed, ECONNREFUSED, ENOTFOUND, ETIMEDOUT, ECONNRESET)
    • HfDownloadCircuitBreaker — closed/open/half-open state machine; opens after 3 consecutive network failures, resets after 60 s (both configurable). A module-level singleton hfDownloadCircuit is shared by both embedder entry points.
    • withDownloadTimeout(fn, ms) — wraps a download in a hard time-limit (5 minutes default); timeout throws ETIMEDOUT so isNetworkFetchError classifies it correctly
    • withHfDownloadRetry(fn, opts?) — wraps pipeline() with per-attempt timeout, exponential-backoff retry on network errors (3 attempts, 2 s → 4 s delay), circuit-breaker recording, and an optional onRetry callback for logging. The per-attempt timeout and max attempts are overridable via HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS env vars; both are validated and clamped (timeoutMs ≤ 30 min, maxAttempts floored and ≤ 10) to prevent accidental runaway configuration.
    • isHfDownloadFailure(msg) — combines isNetworkFetchError and isHfCircuitOpenError for a single guard covering both raw network errors and circuit-open rejections
    • HF_MAX_TIMEOUT_MS / HF_MAX_ATTEMPTS_CAP — exported upper-bound constants used for env-override clamping
  • core/embeddings/embedder.ts + mcp/core/embedder.tspipeline() call is now wrapped in withHfDownloadRetry(); the device-fallback catch block uses isHfDownloadFailure and rethrows immediately with an actionable message when the failure is network-level (device fallback is meaningless for network errors). The message includes both Unix and Windows-compatible command syntax:

    Failed to download embedding model: fetch failed
      huggingface.co may be unreachable from your network.
      Set HF_ENDPOINT to a mirror and retry:
        HF_ENDPOINT=https://hf-mirror.com npx gitnexus analyze --embeddings
        (Windows: set HF_ENDPOINT=https://hf-mirror.com && npx gitnexus analyze --embeddings)
    
  • cli/analyze.ts — the HF download failure branch is now checked before writeFatalToStderr, following the same early-return pattern as RegistryNameCollisionError. This eliminates duplicate output (previously a raw stack trace with an embedded hint was printed, followed by a second cliError() remediation block). The cliError() message also includes the Windows-compatible command syntax.

  • test/unit/hf-env.test.ts — 49 tests total (34 new) covering: circuit breaker state machine transitions (including half-open → failure → reopen and half-open → success → closed paths), withDownloadTimeout (using vi.useFakeTimers() for full determinism), retry exhaustion, circuit opening mid-run, onRetry callback, non-network error passthrough, all original error pattern / negative cases, and a full withHfDownloadRetry env overrides suite validating HF_MAX_ATTEMPTS (valid values, invalid/zero/negative fallback to default, clamping to 10, floor of fractional values), HF_DOWNLOAD_TIMEOUT_MS (valid, invalid fallback, clamping to 30 min), and explicit options taking precedence over env vars; all time-sensitive assertions use vi.useFakeTimers()

@vercel

vercel Bot commented May 7, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
gitnexus Ready Ready Preview, Comment May 8, 2026 7:11am

Request Review

Copilot AI linked an issue May 7, 2026 that may be closed by this pull request
…oad failure

When `gitnexus analyze --embeddings` fails because huggingface.co is
unreachable (e.g. the GFW, corporate proxies), the error was shown as a
raw `TypeError: fetch failed` with no actionable guidance.

Changes:
- `hf-env.ts`: add and export `isNetworkFetchError()` helper that
  detects network-level fetch errors (fetch failed, ECONNREFUSED,
  ENOTFOUND, ETIMEDOUT, ECONNRESET)
- `core/embeddings/embedder.ts`: in the device-fallback loop, detect
  network errors and rethrow immediately with a message telling the user
  to set HF_ENDPOINT to a mirror (hf-mirror.com) — device fallback is
  meaningless for network errors that will fail on every device
- `mcp/core/embedder.ts`: same fix for the MCP embedder entry point
- `cli/analyze.ts`: add a new error branch that detects fetch/network
  failures and prints a concrete 3-step remediation hint (HF_ENDPOINT,
  proxy/VPN, offline caching)
- `test/unit/hf-env.test.ts`: add 8 unit tests covering all five
  network error patterns and three negative cases

Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/3e314b1c-ca74-44d5-9913-c1418d4e160a

Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
…uplication

Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/3e314b1c-ca74-44d5-9913-c1418d4e160a

Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
@magyargergo

Copy link
Copy Markdown
Collaborator

@copilot rebase on origin/main

Copilot AI changed the title [WIP] Fix load embedding failure on mac for GitNexus fix: actionable HF_ENDPOINT guidance when embedding model download fails May 7, 2026
Copilot AI requested a review from magyargergo May 7, 2026 19:57
…conflicts

Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>

Copilot AI commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

@copilot rebase on origin/main

Done — merged origin/main (feat(core): adopt pino structured logger #1336) in commit 4f96ff8. Conflicts in analyze.ts and embedder.ts were resolved, aligning the new hf-endpoint-unreachable error branch with the cliError() pattern introduced in main.

@magyargergo

Copy link
Copy Markdown
Collaborator

@copilot we also properly need retries and timeout with ciruit breakes in place.

Copilot AI commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

@copilot we also properly need retries and timeout with ciruit breakes in place.

Done in commit 681681a. Added to hf-env.ts:

  • HfDownloadCircuitBreaker — closed/open/half-open state machine; opens after 3 consecutive network failures, resets after 60 s (both configurable). A module-level singleton hfDownloadCircuit is shared by both embedder entry points.
  • withHfDownloadRetry(fn, opts) — wraps pipeline() with: a 5-minute per-attempt timeout (ETIMEDOUT → still caught by isNetworkFetchError), exponential-backoff retry (3 attempts, 2 s → 4 s delay), and circuit-breaker integration. When the circuit trips, callers get a fast-fail with the same HF_ENDPOINT remediation hint.
  • isHfDownloadFailure(msg) — combines isNetworkFetchError and circuit-open detection; used in both embedder catch blocks and cli/analyze.ts.

21 new tests cover the circuit state machine, timeout, retry exhaustion, circuit opening mid-run, onRetry callback, and non-network passthrough.

Copilot AI changed the title fix: actionable HF_ENDPOINT guidance when embedding model download fails fix: actionable HF_ENDPOINT guidance, retries, timeout and circuit breaker when embedding model download fails May 7, 2026
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 5m 3s —— View job


PR #1419 Review — Adversarial Production Readiness

Task List

  • Post initial comment
  • Read repo context (DoD.md, AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, TESTING.md, ARCHITECTURE.md)
  • Inspect changed files (hf-env.ts, embedder.ts × 2, analyze.ts, hf-env.test.ts)
  • Inspect related files (config.ts, types.ts, cli-message.ts, logger.ts, package.json)
  • Check CI/typecheck status
  • Hidden Unicode / bidi scan
  • Assess merge-conflict artifacts
  • Evaluate error classification correctness
  • Evaluate retry/timeout/circuit-breaker logic
  • Evaluate core + MCP embedder integration
  • Evaluate CLI UX
  • Evaluate test determinism
  • Evaluate worst-case delay / UX risk
  • Write final review

Review bar for this PR

  • isNetworkFetchError() must not classify CUDA/DirectML/WASM/ONNX device failures as network failures — device fallback must be preserved for non-network errors.
  • Network errors must short-circuit device fallback immediately with actionable HF_ENDPOINT guidance.
  • withHfDownloadRetry() must only retry network-level failures; non-network errors must propagate immediately without retry.
  • Circuit breaker must open after threshold, fail fast while open, and recover via half-open → closed.
  • Retry backoff must be bounded and tests must be deterministic (no real 5-minute waits).
  • pino logger and cliError() must be used consistently after the pino/cliError merge; no console.* regression or duplicate guidance noise.
  • Hidden Unicode controls must not affect executable code paths.
  • CLI and MCP embedder paths must be behaviourally aligned.

Current PR state

Field Value
Head SHA 681681a
Status Open (non-draft based on CI running)
Commits 5 (Initial plan → final retry/circuit feature)
Changed files 5 (hf-env.ts, embedder.ts, mcp/core/embedder.ts, analyze.ts, hf-env.test.ts)
CI 6/8 complete, 0 failed. "CI" and "Docker Build & Push" in-progress. No failures visible.
Dependency review ✅ passed (no new dependencies introduced)
Gitleaks ✅ passed
CodeQL ✅ passed
Hidden Unicode ✅ No bidi/control characters. Non-ASCII chars are em-dashes in comments/strings and pre-existing emoji in logger.info calls — all in human-readable content, not executable logic.
Vercel Irrelevant to this embedding/CLI/MCP reliability review.

Branch hygiene / merge-conflict assessment

Clean. The conflict from the pino/cliError() merge (4f96ff8) was resolved correctly:

  • analyze.ts — single clean import of isHfDownloadFailure from hf-env.js (line 30); no duplicate imports. cliError() is used consistently throughout; no old console.error mixing. process.exitCode = 1 appears exactly once.
  • embedder.ts — single import of { applyHfEnvOverrides, isHfDownloadFailure, withHfDownloadRetry } (line 25). No duplicate applyHfEnvOverrides. pipeline() call is wrapped exactly once per device attempt. Pino logger.warn({ attempt, max, err: err.message }, ...) usage is correct.
  • No dangling braces, no duplicate });, no leftover merge-conflict comments, no any-cast escape hatches added for typecheck workarounds.

Understanding of the change

Why TypeError: fetch failed was poor UX: The raw undici error message gave no indication that HF_ENDPOINT could reroute the download, and silently continued into a meaningless device-fallback loop (trying CUDA → CPU → WASM for what was purely a network problem).

Why HF_ENDPOINT is the right hint: transformers.js reads env.remoteHost, not HF_ENDPOINT directly. The existing applyHfEnvOverrides() bridge already maps the standard env var into transformers.js — surfacing it in the failure path closes the documentation gap.

Why network errors must skip device fallback: The model file cannot be downloaded on any device if the network is blocked. Falling back from CUDA to CPU to WASM wastes time and produces multiple identical failures.

Why retry/timeout/circuit breaker helps: Transient jitter (packet loss, flaky CDN edge) benefits from a couple of retries with backoff. A per-attempt timeout prevents a stalled partial download from hanging forever. The circuit breaker prevents an MCP long-running process from repeatedly hammering an unreachable endpoint.


Findings

[medium] Duplicate guidance output for HF download failures

  • Category: CLI UX
  • Files: gitnexus/src/cli/analyze.ts:586, 633–644
  • Issue: For HF download failures, writeFatalToStderr('Analysis failed', err) on line 586 is always called first and prints the full error message, which already contains the embedded HF_ENDPOINT hint added by the embedder. The cliError(...) block on line 634 then prints a second, differently-worded remediation guide. The user sees: (1) raw error text + stack trace with embedded hint, (2) the clean 3-step cliError block. Compare with RegistryNameCollisionError and AnalysisNotFinalizedError which use early return to avoid this double-output.
  • Why it matters: Noisy output reduces clarity. The clean cliError guidance is good; the preceding raw-error-plus-stack-trace makes it hard to find. Users on slow terminals may not scroll to see the cliError block.
  • Recommended fix: Move the HF case to a guarded block before writeFatalToStderr (matching the pattern of RegistryNameCollisionError), or suppress the stack trace for known-expected download failures. Fix this →
  • Blocks merge: no (consistent with existing behavior for heap/module errors which have the same pattern; the guidance is still shown)

[medium] Worst-case failure delay ≈ 15 minutes for slow/partial connections

  • Category: Performance / UX
  • Files: gitnexus/src/core/embeddings/hf-env.ts:9,11
  • Issue: HF_MAX_ATTEMPTS=3 × HF_DOWNLOAD_TIMEOUT_MS=5min + 2s + 4s backoff ≈ 15 minutes 6 seconds worst case. For connection-refused / DNS failure this is not an issue (fast rejection, total ~6 seconds). But for partially-open connections where data trickles then stalls, a user behind a transparent proxy or CDN with broken range-request support could wait 15 minutes before seeing the actionable error message.
  • Why it matters: CLI UX. An MCP server requiring embeddings could block tool calls for 15 minutes before surfacing actionable guidance. No env override exists to reduce per-attempt timeout or total attempts.
  • Recommended fix: Expose HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS as env-var overrides (or at minimum document them in the error message). Consider reducing the default timeout to 2–3 minutes for the first attempt when no HF_ENDPOINT is set. Add a total-timeout bound as a follow-up.
  • Blocks merge: no (for the DNS/connection-refused common case, actual wait is ~6 seconds; partial-connection is an edge case)

[minor] withDownloadTimeout test uses real 20ms timer without fake timers

  • Category: Test determinism
  • Files: gitnexus/test/unit/hf-env.test.ts:210–213
  • Issue: withDownloadTimeout(neverResolves, 20) waits for a real 20ms setTimeout. Under CI load or slow workers, this could occasionally flake or extend test runtime unnecessarily.
  • Recommended fix: Wrap in vi.useFakeTimers() / vi.advanceTimersByTime(30) consistent with the HfDownloadCircuitBreaker reset test. Fix this →
  • Blocks merge: no

[minor] Missing half-open → failure → reopen and half-open → success (via retry) tests

  • Category: Test coverage
  • Files: gitnexus/test/unit/hf-env.test.ts
  • Issue: The HfDownloadCircuitBreaker tests cover: closed → open, open → closed (via recordSuccess()), and open → half-open (via fake timer). But two scenarios are not directly tested: (a) half-open failure re-opens the circuit; (b) a successful withHfDownloadRetry call from half-open state closes the circuit and resets _failures. Without these tests, a regression in the state getter's half-open logic could go undetected.
  • Recommended fix: Add two tests: one calling cb.recordFailure() from half-open and asserting isOpen() === true; one calling withHfDownloadRetry from half-open with a succeeding fn and asserting state === 'closed'.
  • Blocks merge: no

[minor] Windows env-var syntax in CLI guidance and embedder hint

  • Category: CLI UX
  • Files: gitnexus/src/cli/analyze.ts:639, gitnexus/src/core/embeddings/embedder.ts:252
  • Issue: HF_ENDPOINT=https://hf-mirror.com npx gitnexus analyze --embeddings uses Unix VAR=value cmd syntax that does not work in Windows cmd.exe or PowerShell. Given GitNexus supports Windows (DirectML device, DML fallback code), users on Windows will get actionable guidance they cannot directly execute.
  • Recommended fix: Add a note: (Windows: set HF_ENDPOINT=https://hf-mirror.com && npx gitnexus analyze --embeddings) or document in the error text. Low priority if Windows users are expected to use WSL.
  • Blocks merge: no

[cosmetic] Redundant msg.includes('Failed to download embedding model') fallback in analyze.ts

  • Category: Code clarity
  • Files: gitnexus/src/cli/analyze.ts:633
  • Issue: isHfDownloadFailure(msg) already matches because the embedder always embeds the original errMsg (which contains fetch failed, ECONNREFUSED, or hf-circuit-open) into the wrapper exception message. The msg.includes('Failed to download embedding model') branch is never reached independently.
  • Why it matters: Cosmetic only; no behavioral impact. The fallback does serve as a safety net if future callers produce this prefix without a matched network pattern.
  • Blocks merge: no

HF endpoint / error classification assessment

applyHfEnvOverrides() — correct. HF_ENDPOINTenv.remoteHost with trailing slash normalization. Whitespace-only values are silently ignored via .trim() guard. Called before pipeline() in both core and MCP embedder. No credentials are printed (the hint uses a hardcoded mirror URL, not the user's HF_ENDPOINT value).

isNetworkFetchError() — correct for the intended cases. Catches all five target patterns case-sensitively (these are Node.js errno constants, always uppercase). Does not match CUDA, DirectML, WASM, ONNX, Cannot find module, or any device/filesystem pattern. Negative test cases confirm this.

isHfCircuitOpenError() — uses a unique tag string 'hf-circuit-open'. Spoofing risk is low — arbitrary user-controlled strings don't flow into err.message at this code path (the error is synthesized internally by withHfDownloadRetry).

isHfDownloadFailure() — correctly combines both guards. The analyze.ts check isHfDownloadFailure(msg) || msg.includes('Failed to download embedding model') adds a belt-and-suspenders fallback for the wrapper prefix.

Credential safety: process.env.HF_ENDPOINT is printed directly in the embedder's guidance hint (The configured endpoint (${process.env.HF_ENDPOINT}) may be unreachable.). Custom corporate endpoints with embedded auth tokens in the URL would be exposed in error output. This is an accepted minor risk for mirror URLs, but worth documenting.


Retry / timeout / circuit breaker assessment

withDownloadTimeout() — uses manual fn().then(resolve, reject) + clearTimeout pattern (not Promise.race). This correctly clears the timer on both success and failure paths. A late fn() resolution after timeout fires cannot call circuit.recordSuccess() — the outer Promise is already settled; late resolve(v) is a no-op. The underlying pipeline() download cannot be cancelled (Transformers.js has no AbortController support) and continues in the background, consuming network bandwidth after the timeout reject. This is an accepted limitation; documenting it in the code comment would be useful.

Backoff sequencebaseDelayMs * 2^attempt: attempt 0 → 2000ms, attempt 1 → 4000ms. Matches the claimed 2s → 4s. Bounded. ✓

Non-network passthroughisNetworkFetchError(lastError.message) gates retries. Non-network errors (CUDA init, ONNX, module not found) throw immediately without retry and without circuit.recordFailure(). ✓

Circuit opens mid-run — after circuit.recordFailure() on attempt N, circuit.isOpen() is checked. If the threshold is reached, CIRCUIT_OPEN_TAG error is thrown immediately (no sleep, no further attempts). ✓

Singleton state — the module-level hfDownloadCircuit singleton is acceptable: for CLI (short-lived), it only affects one invocation; for MCP (long-running), it correctly prevents repeated hammering. Tests use fresh HfDownloadCircuitBreaker instances — no singleton leakage between tests. ✓

Worst-case delay — as noted above, 15 minutes for slow partial connections. For the common cases (ENOTFOUND, ECONNREFUSED), actual wait is ~6 seconds total. Acceptable for those cases.


Core embedder assessment

  • pipeline() is wrapped exactly once per device attempt in withHfDownloadRetry(). ✓
  • applyHfEnvOverrides(env) called before the device loop. ✓
  • All pipeline options (modelId, device, dtype, progress_callback, session_options) unchanged. ✓
  • Device fallback loop preserved for non-network errors: if CUDA fails with Cannot initialize CUDA, the loop continues to cpu. ✓
  • Network errors short-circuit via isHfDownloadFailure(errMsg) check in the device catch block — rethrows immediately with actionable message. ✓
  • currentDevice assignment happens only after successful pipeline init (line 228). ✓
  • logger.warn({ attempt, max, err: err.message }, '...') — correct pino shape for objects + message string. ✓
  • HTTP embedding backend (isHttpMode()) is unaffected (checked before any device setup). ✓

MCP embedder assessment

  • Same withHfDownloadRetry wrapper with shared hfDownloadCircuit singleton. ✓
  • Same applyHfEnvOverrides(env) + isHfDownloadFailure(errMsg) + early rethrow pattern. ✓
  • silenceStdout() + process.stderr.write = (() => true) wrap the entire withHfDownloadRetry call. The finally block correctly restores both. This means: during all retry attempts, stdout/stderr are suppressed. For MCP, this is correct (ONNX/transformers.js progress text must not corrupt stdio protocol). However, if the retry cycle is long (up to 15 minutes for slow connections), stdout/stderr are suppressed for that entire window. This is unlikely to cause observable issues in practice but is worth noting.
  • Errors thrown from MCP embedder propagate up to embedQuery() callers, which surface them as MCP tool errors. The thrown message includes the actionable HF guidance — useful for MCP client logs. ✓
  • Pino logger (logger) writes to fd 2 (stderr) via a buffered SonicBoom destination. Since process.stderr.write is temporarily replaced during model load, pino records emitted during that window are silenced. After restoreStdout(), normal stderr behavior resumes. ✓

CLI error UX assessment

isHfDownloadFailure branch order in analyze.ts is correct — it is checked after the RegistryNameCollisionError and AnalysisNotFinalizedError class checks, and before the generic fallthrough. ✓

The cliError() remediation message is clear and actionable:

  The embedding model could not be downloaded.
  huggingface.co may be unreachable from your network
  Suggestions:
    1. Set HF_ENDPOINT to a mirror and retry:
         HF_ENDPOINT=https://hf-mirror.com npx gitnexus analyze --embeddings
    2. Check your proxy / VPN settings.
    3. Once downloaded the model is cached — future runs work offline.

recoveryHint: 'hf-endpoint-unreachable' is stable and correctly placed. ✓

Duplicate output concern: as noted in Findings #1, writeFatalToStderr runs unconditionally and the HF exception message already contains inline guidance. The clean cliError block follows, creating noise. Not a blocker.

process.exitCode = 1 is set once at the bottom of the catch block (not inside the if/else-if chain), which is correct. ✓

Raw stack trace IS printed for HF errors (via writeFatalToStderr). This is consistent with heap/module errors but more noisy than ideal for an expected network failure.


Test assessment

Coverage:

Scenario Covered?
All 5 network error patterns
Non-network negatives (CUDA, empty, module-not-found)
withDownloadTimeout success
withDownloadTimeout timeout rejects ETIMEDOUT ✅ (real timer, minor flake risk)
Timeout propagates non-timeout errors
Retry success after transient failure
Retry exhaustion throws last network error
Non-network error no retry
Circuit opens after threshold
Circuit fails fast while open
Circuit half-open after reset timeout
Half-open → failure → reopens ❌ missing
Half-open → success closes circuit ❌ missing (closes immediately but half-open via retry path not tested)
onRetry callback args
Circuit reset between tests ✅ (fresh instances used)
Singleton not leaked between tests
isHfCircuitOpenError + isHfDownloadFailure

Fake timers — used correctly for the half-open transition test (try/finally pattern). ✓

Backoff tests — use baseDelayMs: 0 to avoid real delays. Deterministic. ✓

Missing integration tests — no mocked pipeline() tests covering: device fallback preserved for device errors; network error skips device fallback; CLI analyze printing guidance. Desirable follow-ups, not blockers.


Performance / UX assessment

Worst case (slow partial connection): 15 min 6 sec. Unacceptable for a completely blocked endpoint where connection-level errors fire immediately. Acceptable for the common (ENOTFOUND, ECONNREFUSED) case (~6 seconds total). Recommended follow-up: add HF_DOWNLOAD_TIMEOUT_MS env override to let users reduce per-attempt timeout.

Cached/offline behavior: applyHfEnvOverrides sets env.cacheDir. If the model is already cached at that path, Transformers.js loads from cache without network access. No retry penalty for cached models. ✓

Retry noiseonRetry logging only fires in NODE_ENV=development. Production users do not see retry-attempt logs. ✓


Hidden Unicode / hygiene assessment

No bidi/directional control characters (U+202A–U+202E, U+2066–U+2069) found in any of the 5 changed files.

Non-ASCII characters found:

  • Em-dashes (, U+2014) — in JSDoc comments and string literals (e.g. "— will reset in ~${secsUntilReset}s"). Harmless typographic characters in human-readable output.
  • Emojis (🧠, 🔧, , ⚠️) in logger.info calls — these are pre-existing in embedder.ts, not introduced by this PR.
  • Arrows (, ) in comments and analyze.ts string literals — pre-existing.

GitHub's "hidden Unicode" warning for this PR refers to these benign characters. No executable code logic is affected by Unicode in conditions, identifiers, or control flow. Not a blocker.

No generated files, no package.json changes, no package-lock churn. ✓


Elegance / maintainability assessment

hf-env.ts is cohesive: all HF download resilience helpers in one file. Constants are named and documented with JSDoc. withHfDownloadRetry API is small (fn + optional overrides). The HfRetryOptions.circuit override for test injection is the right pattern.

Retry/circuit logic is not duplicated between core and MCP embedder — both import from hf-env.ts. Error classification is centralized. ✓

Comments in embedder catch blocks correctly explain WHY device fallback is skipped for network errors — this is non-obvious reasoning, well-documented. ✓

No unrelated embedding behavior changed. HTTP backend path unchanged. ✓


Final verdict

production-ready with minor follow-ups

The core correctness properties all hold: network errors are classified correctly and do not misclassify device failures; pipeline() is wrapped exactly once per attempt; device fallback is correctly preserved for ONNX/CUDA errors and correctly skipped for network errors; circuit breaker state transitions are implemented correctly; CLI and MCP embedder are behaviourally aligned; the pino/cliError merge is clean with no conflict artifacts.

The two medium findings — duplicate guidance output (the writeFatalToStderr before the cliError HF block) and the 15-minute worst-case wait for slow partial connections — are real but follow the existing codebase pattern for other error types and only affect slow-connection edge cases respectively. The missing half-open tests and real-timer test are minor hygiene gaps.

Follow-ups recommended before 1.0 or high-traffic release: (1) move HF error handling before writeFatalToStderr to eliminate duplicate output and stack-trace noise, (2) expose HF_DOWNLOAD_TIMEOUT_MS as an env override, (3) add half-open state machine tests, (4) add Windows env-var syntax note.

@magyargergo

Copy link
Copy Markdown
Collaborator

@copilot tackle with all of the findings from above plus make sure npx prettier --write . is run and committed to this branch!

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
8214 8213 0 1 378s

✅ All 8213 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 77.8% 24987/32114 N/A% 🟢 ███████████████░░░░░
Branches 66.36% 15805/23816 N/A% 🟢 █████████████░░░░░░░
Functions 82.95% 2501/3015 N/A% 🟢 ████████████████░░░░
Lines 80.86% 22583/27927 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

Repository owner deleted a comment from github-actions Bot May 8, 2026
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 6m 8s —— View job


PR #1419 Review — Adversarial Production Readiness

Task List

  • Post initial comment
  • Read repo context (AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, ARCHITECTURE.md)
  • Inspect changed files (hf-env.ts, embedder.ts × 2, analyze.ts, hf-env.test.ts)
  • Inspect related files (config.ts, types.ts, cli-message.ts, logger.ts)
  • Check git log / head SHA / CI status
  • Hidden Unicode / bidi scan
  • Assess merge-conflict artifacts
  • Evaluate error classification correctness
  • Evaluate retry/timeout/circuit-breaker logic
  • Evaluate core + MCP embedder integration
  • Evaluate CLI UX
  • Evaluate test determinism + coverage
  • Evaluate env override parsing safety
  • Write final review

Review bar for this PR

  • isNetworkFetchError() must correctly classify fetch failed / ECONNREFUSED / ENOTFOUND / ETIMEDOUT / ECONNRESET without misclassifying CUDA / WASM / ONNX device failures — device fallback must be preserved for non-network errors.
  • withHfDownloadRetry() must retry only network errors; non-network errors propagate immediately.
  • Circuit breaker must open after threshold, fail fast while open, recover via half-open → closed; half-open failure must reopen.
  • Retry backoff must be bounded; tests must be deterministic (no real timers in timeout/circuit tests).
  • HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS env overrides must validate safely and be tested.
  • cliError() must be used before writeFatalToStderr for expected HF failures; no raw stack trace for network download errors.
  • pino logger used consistently; no console.* regression.
  • Hidden Unicode/bidi controls must not appear in executable code.

Current PR state

Field Value
Head SHA 3c160c88f038a918653ba400aa7ae87a494a862f
Status Open (non-draft)
Commits 10 (includes two merge-forward commits from main)
Changed files 5 (hf-env.ts, embedder.ts, mcp/core/embedder.ts, analyze.ts, hf-env.test.ts)
CI ✅ All checks passed (typecheck ✓, unit tests 8206/8206 ✓, e2e ✓)
Typecheck tsc --noEmit success
hf-env.test.ts ✅ 38 tests
Package state No new dependencies, no package-lock churn
Hidden Unicode ✅ Zero bidi/directional controls (U+202A–202E, U+2066–2069). Non-ASCII are em-dashes () in JSDoc comments — benign.
Vercel Irrelevant to this embedding/CLI/MCP reliability review
Prior findings All 5 findings from the prior review are confirmed fixed on this HEAD

Branch hygiene / merge-conflict assessment

Clean. Two merge-forward commits (4f96ff8, 3c160c8) resolved conflicts from the pino/cliError() merge correctly.

  • core/embeddings/embedder.ts — single clean import block (line 25: { applyHfEnvOverrides, isHfDownloadFailure, withHfDownloadRetry }). No duplicate imports. No console.* usage. pipeline() wrapped exactly once in withHfDownloadRetry() per device attempt. applyHfEnvOverrides(env) called before the device loop. Pino logger.warn({ attempt, max, err: err.message }, '...') correctly shaped.
  • mcp/core/embedder.ts — single import block lines 9–19. pipeline() wrapped exactly once. silenceStdout() + stderr suppression in the correct inner try/finally around withHfDownloadRetry. No duplicate applyHfEnvOverrides calls.
  • analyze.ts — single import of isHfDownloadFailure at line 30. process.exitCode = 1 set correctly. No console.error mixed with pino. No dangling braces or duplicate });.

No any-cast escape hatches introduced. No stale conflict comments.


Understanding of the change

TypeError: fetch failed was poor UX because it gave no indication that HF_ENDPOINT could reroute the download, and silently continued into a device-fallback loop (CUDA → CPU → WASM) that was meaningless for a network-level failure. HF_ENDPOINT is the right hint because applyHfEnvOverrides() already maps it to env.remoteHost, which transformers.js actually reads. Network errors should not trigger device fallback because the model file cannot be fetched on any device if the network is blocked. Retry/timeout/circuit-breaker helps transient jitter (packet loss, CDN edge resets) while bounding the worst-case wait and preventing MCP servers from hammering an unreachable endpoint.


Findings

[medium] No unit tests for HF_DOWNLOAD_TIMEOUT_MS / HF_MAX_ATTEMPTS env override parsing

  • Category: Test coverage / DoD
  • Files: gitnexus/test/unit/hf-env.test.ts, gitnexus/src/core/embeddings/hf-env.ts:271–276
  • Issue: The PR exposes HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS as documented, user-facing env overrides. The env override validation logic (Number.isFinite(v) && v > 0) is correct, but there are zero tests for: valid values being read and applied, invalid values (zero, negative, NaN, non-numeric strings like "abc", Infinity) falling back to defaults, and fractional values ("0.5" passes > 0 guard and is accepted as-is). The withHfDownloadRetry test suite uses explicit maxAttempts / timeoutMs option overrides exclusively — the env var path is never exercised.
  • Why it matters: The PR description explicitly lists these overrides as a key deliverable (HF_DOWNLOAD_TIMEOUT_MS=60000 npx gitnexus analyze --embeddings). The DoD bar requires env override tests. A future refactor that accidentally breaks env var reading would not be caught.
  • Recommended fix: Add a describe('withHfDownloadRetry env overrides', ...) block covering: valid HF_DOWNLOAD_TIMEOUT_MS is used as timeoutMs; HF_MAX_ATTEMPTS=1 gives exactly 1 attempt; HF_MAX_ATTEMPTS=abc falls back to default 3; HF_MAX_ATTEMPTS=0 falls back to default; HF_DOWNLOAD_TIMEOUT_MS=-1 falls back to default. Use process.env set/restore in beforeEach/afterEach. Fix this →
  • Blocks merge: No (logic is correct; tests are missing but runtime behavior is safe)

[minor] No upper bound on env override values

  • Category: Defensive validation
  • Files: gitnexus/src/core/embeddings/hf-env.ts:273–276
  • Issue: HF_DOWNLOAD_TIMEOUT_MS=999999999 (277 hours) and HF_MAX_ATTEMPTS=9999 are silently accepted. The guard only checks > 0 and Number.isFinite. A user who sets HF_MAX_ATTEMPTS=9999 gets 9999 × 5 min ≈ 34 days of potential retry. Fractional values also pass: HF_DOWNLOAD_TIMEOUT_MS=0.5 is accepted as a 0.5ms timeout (effectively immediate failure on every attempt).
  • Why it matters: Requires explicit misconfiguration, but a reasonable upper bound (e.g., maxAttempts ≤ 10, timeoutMs ≤ 30 * 60 * 1000) would prevent accidental runaway waits.
  • Recommended fix: Clamp with Math.min(resolvedMaxAttempts, 10) and Math.min(resolvedTimeout, 30 * 60 * 1000). Also consider Math.floor for resolvedMaxAttempts to reject fractional values cleanly.
  • Blocks merge: No (requires explicit user misconfiguration; not a silent failure path)

[cosmetic] Belt-and-suspenders msg.includes('Failed to download embedding model') in analyze.ts

  • Category: Code clarity
  • Files: gitnexus/src/cli/analyze.ts:583
  • Issue: isHfDownloadFailure(msg) || msg.includes('Failed to download embedding model') — the second branch is never reached independently because the embedder always embeds the original network-error string (which matches isNetworkFetchError) into the wrapper message. The fallback has no behavioral impact and serves as a safety net for future callers.
  • Blocks merge: No

HF endpoint / error classification assessment

applyHfEnvOverrides() — correct. HF_ENDPOINTenv.remoteHost with trailing slash normalization. Whitespace-only values are silently treated as unset via .trim() guard. env.cacheDir defaults to ~/.cache/huggingface. No credentials printed (hint uses hardcoded mirror URL, not user's HF_ENDPOINT). Called before pipeline() in both core and MCP embedder. ✓

Credential risk: process.env.HF_ENDPOINT is printed in the embedder's guidance hint when set (The configured endpoint (${process.env.HF_ENDPOINT}) may be unreachable.). Corporate endpoints with auth tokens in the URL would be exposed in error output. This is an accepted risk for mirror URLs and is consistent with the prior review's assessment.

isNetworkFetchError() — correct. Case-sensitive matches on Node.js errno constants (always uppercase). Does NOT match CUDA, DirectML, WASM, ONNX, Cannot find module, or any device/filesystem strings. All 5 error patterns covered by tests, negative cases confirmed. ✓

isHfCircuitOpenError() — unique tag 'hf-circuit-open'. Not reachable via arbitrary user-controlled strings at this code path. ✓

isHfDownloadFailure() — correctly combines both guards. ✓


Retry / timeout / circuit breaker assessment

withDownloadTimeout(): Uses manual fn().then(resolve, reject) + clearTimeout pattern. Clears timer on both success and failure paths. Late fn() resolution after timeout fires cannot call circuit.recordSuccess() — the outer Promise is already settled; late resolve(v) is a no-op. Timeout error message contains ETIMEDOUT so isNetworkFetchError classifies it. Underlying pipeline() download cannot be cancelled (Transformers.js lacks AbortController support) and continues consuming bandwidth after timeout reject — this is an accepted platform limitation, documented in code comment. ✓

withHfDownloadRetry() — env override parsing: Reads process.env.HF_DOWNLOAD_TIMEOUT_MS and process.env.HF_MAX_ATTEMPTS on every call (correct — allows runtime configuration). Number.isFinite(v) && v > 0 guard rejects NaN, Infinity, 0, and negative values. Explicit options.* override env vars (via destructuring default), which override built-in constants — correct precedence. No upper bound clamping (see finding above). No tests for this path (see finding above).

Backoff: baseDelayMs * 2^attempt → 2s, 4s. Bounded. Tests use baseDelayMs: 0 for determinism. ✓

Non-network passthrough: isNetworkFetchError(lastError.message) gates retries. CUDA/ONNX/module errors propagate immediately without circuit.recordFailure(). ✓

Circuit opens mid-run: After recordFailure() on attempt N, circuit.isOpen() is checked; if threshold reached, CIRCUIT_OPEN_TAG error thrown immediately (no further sleep or attempts). ✓

Worst-case delay: Default 3 × 5min + 2s + 4s ≈ 15 min 6 sec for stalled partial connections. For fast-fail cases (ENOTFOUND, ECONNREFUSED), actual total ≈ 6 seconds. HF_DOWNLOAD_TIMEOUT_MS=60000 HF_MAX_ATTEMPTS=2 reduces to ~2 min 2 sec. The env override is documented in the error path comment but not yet in user-facing docs or the error message itself.

Singleton state: Module-level hfDownloadCircuit is acceptable — CLI (short-lived) only runs one invocation; MCP (long-lived) correctly prevents hammering. Tests use fresh HfDownloadCircuitBreaker instances; no singleton leakage. ✓


Core embedder assessment

  • pipeline() wrapped exactly once per device attempt in withHfDownloadRetry(). Old direct pipeline() call fully removed. ✓
  • applyHfEnvOverrides(env) called before device loop, once. ✓
  • finalConfig.modelId, device, dtype, progress_callback, session_options all unchanged. ✓
  • currentDevice assigned only after successful pipeline init (line 228). ✓
  • Device fallback loop preserved: CUDA/DML device error → loop continues to cpu. Network error → isHfDownloadFailure(errMsg) check → rethrow with actionable message, device loop exits. ✓
  • Guidance text includes both Unix and Windows command syntax. ✓
  • logger.warn({ attempt, max, err: err.message }, '...') — correct pino object+message shape. Only fires in NODE_ENV=development. ✓
  • HTTP embedding backend (isHttpMode()) check is before all device setup — unaffected. ✓

MCP embedder assessment

  • Same withHfDownloadRetry() wrapper + shared hfDownloadCircuit singleton. ✓
  • Same applyHfEnvOverrides(env) + isHfDownloadFailure(errMsg) + early rethrow pattern. Behavioral parity confirmed. ✓
  • silenceStdout() + process.stderr.write = (() => true) as any wrap the entire withHfDownloadRetry() call; finally block correctly restores both. During retry backoff sleeps, stdout/stderr remain suppressed — intentional for MCP stdio protocol safety. ✓
  • Error from MCP embedder propagates to embedQuery() callers with actionable HF guidance in the message — useful for MCP client logs. ✓
  • Pino logger writes to fd 2 via SonicBoom. Since process.stderr.write is replaced during model load, pino records emitted during retries are silenced. After restoreStdout(), normal behavior resumes. ✓
  • No onRetry callback configured in MCP embedder — no retry-log noise on stderr. ✓

CLI error UX assessment

The isHfDownloadFailure branch is correctly placed before writeFatalToStderr at line 583, matching the RegistryNameCollisionError early-return pattern. Users now see exactly one focused message block with no preceding stack trace. ✓

Clean output:

  The embedding model could not be downloaded.
  huggingface.co may be unreachable from your network
  (e.g. behind a corporate proxy or a regional firewall).
  Suggestions:
    1. Set HF_ENDPOINT to a mirror and retry:
         HF_ENDPOINT=https://hf-mirror.com npx gitnexus analyze --embeddings
         (Windows: set HF_ENDPOINT=https://hf-mirror.com && npx gitnexus analyze --embeddings)
    2. Check your proxy / VPN settings.
    3. Once downloaded the model is cached — future runs work offline.

recoveryHint: 'hf-endpoint-unreachable' stable. ✓
Windows command syntax present. ✓
process.exitCode = 1 set once at the right scope. ✓
Unexpected errors still reach writeFatalToStderr + relevant cliError blocks. ✓


Test assessment

38 tests total across 6 describe blocks.

Scenario Status
All 5 network error patterns
Non-network negatives (CUDA, empty, module-not-found)
withDownloadTimeout success
withDownloadTimeout timeout with fake timers ✅ (fixed from prior review)
Timeout propagates non-timeout errors
Retry success after transient failure
Retry exhaustion throws last network error
Non-network error no retry
Circuit opens after threshold
Circuit fails fast while open
Circuit half-open after reset
Half-open → failure → reopens ✅ (fixed from prior review)
Half-open → success closes circuit ✅ (fixed from prior review)
onRetry callback args
Circuit state reset between tests ✅ (fresh instances)
HF_DOWNLOAD_TIMEOUT_MS env override valid ❌ missing
HF_DOWNLOAD_TIMEOUT_MS env override invalid ❌ missing
HF_MAX_ATTEMPTS env override valid/invalid ❌ missing

Fake timers used correctly via vi.useFakeTimers() + try/finally in all time-sensitive tests. Backoff tests use baseDelayMs: 0. No singleton leakage between tests. ✓

Missing integration tests (desirable follow-ups, not blockers): mocked pipeline() to verify device fallback preserved for device errors; CLI analyze prints one clean cliError block; MCP embedder uses same helper.


Performance / UX assessment

Worst case (stalled partial connection): 3 × 5min + 6s ≈ 15 min 6 sec. Mitigated by HF_DOWNLOAD_TIMEOUT_MS / HF_MAX_ATTEMPTS overrides (documented in hf-env.ts comment, not yet in the error message itself).

Fast-fail cases (ENOTFOUND, ECONNREFUSED): ≈ 6 seconds total — acceptable for all common blocked-network scenarios.

Cached model: applyHfEnvOverrides sets env.cacheDir. If model is already cached, Transformers.js loads from disk with no network access. No retry penalty. ✓

Retry noise: onRetry logging only in NODE_ENV=development. Production users see no retry attempt logs. ✓


Hidden Unicode / hygiene assessment

grep -P "[\x{202A}-\x{202E}\x{2066}-\x{2069}]" across all 5 changed files: zero results. No bidirectional or directional Unicode controls anywhere.

Non-ASCII characters found: em-dashes (, U+2014) in JSDoc comments and the ETIMEDOUT message string. All are human-readable typographic characters in comments or logged strings — none in conditions, identifiers, or control-flow expressions. GitHub's hidden-Unicode warning refers to these benign characters. Not a blocker.

No generated files, no package.json changes, no package-lock churn. ✓


Elegance / maintainability assessment

hf-env.ts is cohesive: all HF download resilience helpers in one file, exported with @internal JSDoc, well-documented constants, clean type definitions. Retry/circuit logic not duplicated between core and MCP embedder — shared via import. withHfDownloadRetry API is small and injectable (circuit option for test isolation). Comments in embedder catch blocks correctly explain WHY device fallback is skipped — non-obvious reasoning, well-documented. Env override parsing is simple and safe. No unrelated embedding behavior changed. No over-broad retry blanket — non-network errors propagate immediately without retry.


Final verdict

production-ready with minor follow-ups

All prior review findings have been correctly addressed: the HF error check is before writeFatalToStderr (clean single-message UX), HF_DOWNLOAD_TIMEOUT_MS/HF_MAX_ATTEMPTS env overrides are exposed and validated, fake timers are used correctly in the timeout test, half-open → failure and half-open → success circuit tests are present, and Windows command syntax appears in all three output locations. The core correctness properties all hold: network errors are classified correctly without misclassifying device failures, pipeline() is wrapped exactly once per attempt in both embedders, circuit breaker state machine is correct, pino/cliError merge is clean.

The two remaining findings — missing unit tests for the env override parsing code path (medium, DoD gap) and no upper bound clamping on env override values (minor) — do not affect production behavior for any non-explicitly-misconfigured user. The env override logic itself is correct; only test coverage is absent. These are appropriate follow-ups before a high-traffic release but do not block merge.

@magyargergo

Copy link
Copy Markdown
Collaborator

Claude finished @magyargergo's task in 6m 8s —— View job

PR #1419 Review — Adversarial Production Readiness

Task List

  • Post initial comment
  • Read repo context (AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, ARCHITECTURE.md)
  • Inspect changed files (hf-env.ts, embedder.ts × 2, analyze.ts, hf-env.test.ts)
  • Inspect related files (config.ts, types.ts, cli-message.ts, logger.ts)
  • Check git log / head SHA / CI status
  • Hidden Unicode / bidi scan
  • Assess merge-conflict artifacts
  • Evaluate error classification correctness
  • Evaluate retry/timeout/circuit-breaker logic
  • Evaluate core + MCP embedder integration
  • Evaluate CLI UX
  • Evaluate test determinism + coverage
  • Evaluate env override parsing safety
  • Write final review

Review bar for this PR

  • isNetworkFetchError() must correctly classify fetch failed / ECONNREFUSED / ENOTFOUND / ETIMEDOUT / ECONNRESET without misclassifying CUDA / WASM / ONNX device failures — device fallback must be preserved for non-network errors.
  • withHfDownloadRetry() must retry only network errors; non-network errors propagate immediately.
  • Circuit breaker must open after threshold, fail fast while open, recover via half-open → closed; half-open failure must reopen.
  • Retry backoff must be bounded; tests must be deterministic (no real timers in timeout/circuit tests).
  • HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS env overrides must validate safely and be tested.
  • cliError() must be used before writeFatalToStderr for expected HF failures; no raw stack trace for network download errors.
  • pino logger used consistently; no console.* regression.
  • Hidden Unicode/bidi controls must not appear in executable code.

Current PR state

Field Value
Head SHA 3c160c88f038a918653ba400aa7ae87a494a862f
Status Open (non-draft)
Commits 10 (includes two merge-forward commits from main)
Changed files 5 (hf-env.ts, embedder.ts, mcp/core/embedder.ts, analyze.ts, hf-env.test.ts)
CI ✅ All checks passed (typecheck ✓, unit tests 8206/8206 ✓, e2e ✓)
Typecheck ✅ tsc --noEmit success
hf-env.test.ts ✅ 38 tests
Package state No new dependencies, no package-lock churn
Hidden Unicode ✅ Zero bidi/directional controls (U+202A–202E, U+2066–2069). Non-ASCII are em-dashes () in JSDoc comments — benign.
Vercel Irrelevant to this embedding/CLI/MCP reliability review
Prior findings All 5 findings from the prior review are confirmed fixed on this HEAD

Branch hygiene / merge-conflict assessment

Clean. Two merge-forward commits (4f96ff8, 3c160c8) resolved conflicts from the pino/cliError() merge correctly.

  • core/embeddings/embedder.ts — single clean import block (line 25: { applyHfEnvOverrides, isHfDownloadFailure, withHfDownloadRetry }). No duplicate imports. No console.* usage. pipeline() wrapped exactly once in withHfDownloadRetry() per device attempt. applyHfEnvOverrides(env) called before the device loop. Pino logger.warn({ attempt, max, err: err.message }, '...') correctly shaped.
  • mcp/core/embedder.ts — single import block lines 9–19. pipeline() wrapped exactly once. silenceStdout() + stderr suppression in the correct inner try/finally around withHfDownloadRetry. No duplicate applyHfEnvOverrides calls.
  • analyze.ts — single import of isHfDownloadFailure at line 30. process.exitCode = 1 set correctly. No console.error mixed with pino. No dangling braces or duplicate });.

No any-cast escape hatches introduced. No stale conflict comments.

Understanding of the change

TypeError: fetch failed was poor UX because it gave no indication that HF_ENDPOINT could reroute the download, and silently continued into a device-fallback loop (CUDA → CPU → WASM) that was meaningless for a network-level failure. HF_ENDPOINT is the right hint because applyHfEnvOverrides() already maps it to env.remoteHost, which transformers.js actually reads. Network errors should not trigger device fallback because the model file cannot be fetched on any device if the network is blocked. Retry/timeout/circuit-breaker helps transient jitter (packet loss, CDN edge resets) while bounding the worst-case wait and preventing MCP servers from hammering an unreachable endpoint.

Findings

[medium] No unit tests for HF_DOWNLOAD_TIMEOUT_MS / HF_MAX_ATTEMPTS env override parsing

  • Category: Test coverage / DoD
  • Files: gitnexus/test/unit/hf-env.test.ts, gitnexus/src/core/embeddings/hf-env.ts:271–276
  • Issue: The PR exposes HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS as documented, user-facing env overrides. The env override validation logic (Number.isFinite(v) && v > 0) is correct, but there are zero tests for: valid values being read and applied, invalid values (zero, negative, NaN, non-numeric strings like "abc", Infinity) falling back to defaults, and fractional values ("0.5" passes > 0 guard and is accepted as-is). The withHfDownloadRetry test suite uses explicit maxAttempts / timeoutMs option overrides exclusively — the env var path is never exercised.
  • Why it matters: The PR description explicitly lists these overrides as a key deliverable (HF_DOWNLOAD_TIMEOUT_MS=60000 npx gitnexus analyze --embeddings). The DoD bar requires env override tests. A future refactor that accidentally breaks env var reading would not be caught.
  • Recommended fix: Add a describe('withHfDownloadRetry env overrides', ...) block covering: valid HF_DOWNLOAD_TIMEOUT_MS is used as timeoutMs; HF_MAX_ATTEMPTS=1 gives exactly 1 attempt; HF_MAX_ATTEMPTS=abc falls back to default 3; HF_MAX_ATTEMPTS=0 falls back to default; HF_DOWNLOAD_TIMEOUT_MS=-1 falls back to default. Use process.env set/restore in beforeEach/afterEach. Fix this →
  • Blocks merge: No (logic is correct; tests are missing but runtime behavior is safe)

[minor] No upper bound on env override values

  • Category: Defensive validation
  • Files: gitnexus/src/core/embeddings/hf-env.ts:273–276
  • Issue: HF_DOWNLOAD_TIMEOUT_MS=999999999 (277 hours) and HF_MAX_ATTEMPTS=9999 are silently accepted. The guard only checks > 0 and Number.isFinite. A user who sets HF_MAX_ATTEMPTS=9999 gets 9999 × 5 min ≈ 34 days of potential retry. Fractional values also pass: HF_DOWNLOAD_TIMEOUT_MS=0.5 is accepted as a 0.5ms timeout (effectively immediate failure on every attempt).
  • Why it matters: Requires explicit misconfiguration, but a reasonable upper bound (e.g., maxAttempts ≤ 10, timeoutMs ≤ 30 * 60 * 1000) would prevent accidental runaway waits.
  • Recommended fix: Clamp with Math.min(resolvedMaxAttempts, 10) and Math.min(resolvedTimeout, 30 * 60 * 1000). Also consider Math.floor for resolvedMaxAttempts to reject fractional values cleanly.
  • Blocks merge: No (requires explicit user misconfiguration; not a silent failure path)

[cosmetic] Belt-and-suspenders msg.includes('Failed to download embedding model') in analyze.ts

  • Category: Code clarity
  • Files: gitnexus/src/cli/analyze.ts:583
  • Issue: isHfDownloadFailure(msg) || msg.includes('Failed to download embedding model') — the second branch is never reached independently because the embedder always embeds the original network-error string (which matches isNetworkFetchError) into the wrapper message. The fallback has no behavioral impact and serves as a safety net for future callers.
  • Blocks merge: No

HF endpoint / error classification assessment

applyHfEnvOverrides() — correct. HF_ENDPOINTenv.remoteHost with trailing slash normalization. Whitespace-only values are silently treated as unset via .trim() guard. env.cacheDir defaults to ~/.cache/huggingface. No credentials printed (hint uses hardcoded mirror URL, not user's HF_ENDPOINT). Called before pipeline() in both core and MCP embedder. ✓

Credential risk: process.env.HF_ENDPOINT is printed in the embedder's guidance hint when set (The configured endpoint (${process.env.HF_ENDPOINT}) may be unreachable.). Corporate endpoints with auth tokens in the URL would be exposed in error output. This is an accepted risk for mirror URLs and is consistent with the prior review's assessment.

isNetworkFetchError() — correct. Case-sensitive matches on Node.js errno constants (always uppercase). Does NOT match CUDA, DirectML, WASM, ONNX, Cannot find module, or any device/filesystem strings. All 5 error patterns covered by tests, negative cases confirmed. ✓

isHfCircuitOpenError() — unique tag 'hf-circuit-open'. Not reachable via arbitrary user-controlled strings at this code path. ✓

isHfDownloadFailure() — correctly combines both guards. ✓

Retry / timeout / circuit breaker assessment

withDownloadTimeout(): Uses manual fn().then(resolve, reject) + clearTimeout pattern. Clears timer on both success and failure paths. Late fn() resolution after timeout fires cannot call circuit.recordSuccess() — the outer Promise is already settled; late resolve(v) is a no-op. Timeout error message contains ETIMEDOUT so isNetworkFetchError classifies it. Underlying pipeline() download cannot be cancelled (Transformers.js lacks AbortController support) and continues consuming bandwidth after timeout reject — this is an accepted platform limitation, documented in code comment. ✓

withHfDownloadRetry() — env override parsing: Reads process.env.HF_DOWNLOAD_TIMEOUT_MS and process.env.HF_MAX_ATTEMPTS on every call (correct — allows runtime configuration). Number.isFinite(v) && v > 0 guard rejects NaN, Infinity, 0, and negative values. Explicit options.* override env vars (via destructuring default), which override built-in constants — correct precedence. No upper bound clamping (see finding above). No tests for this path (see finding above).

Backoff: baseDelayMs * 2^attempt → 2s, 4s. Bounded. Tests use baseDelayMs: 0 for determinism. ✓

Non-network passthrough: isNetworkFetchError(lastError.message) gates retries. CUDA/ONNX/module errors propagate immediately without circuit.recordFailure(). ✓

Circuit opens mid-run: After recordFailure() on attempt N, circuit.isOpen() is checked; if threshold reached, CIRCUIT_OPEN_TAG error thrown immediately (no further sleep or attempts). ✓

Worst-case delay: Default 3 × 5min + 2s + 4s ≈ 15 min 6 sec for stalled partial connections. For fast-fail cases (ENOTFOUND, ECONNREFUSED), actual total ≈ 6 seconds. HF_DOWNLOAD_TIMEOUT_MS=60000 HF_MAX_ATTEMPTS=2 reduces to ~2 min 2 sec. The env override is documented in the error path comment but not yet in user-facing docs or the error message itself.

Singleton state: Module-level hfDownloadCircuit is acceptable — CLI (short-lived) only runs one invocation; MCP (long-lived) correctly prevents hammering. Tests use fresh HfDownloadCircuitBreaker instances; no singleton leakage. ✓

Core embedder assessment

  • pipeline() wrapped exactly once per device attempt in withHfDownloadRetry(). Old direct pipeline() call fully removed. ✓
  • applyHfEnvOverrides(env) called before device loop, once. ✓
  • finalConfig.modelId, device, dtype, progress_callback, session_options all unchanged. ✓
  • currentDevice assigned only after successful pipeline init (line 228). ✓
  • Device fallback loop preserved: CUDA/DML device error → loop continues to cpu. Network error → isHfDownloadFailure(errMsg) check → rethrow with actionable message, device loop exits. ✓
  • Guidance text includes both Unix and Windows command syntax. ✓
  • logger.warn({ attempt, max, err: err.message }, '...') — correct pino object+message shape. Only fires in NODE_ENV=development. ✓
  • HTTP embedding backend (isHttpMode()) check is before all device setup — unaffected. ✓

MCP embedder assessment

  • Same withHfDownloadRetry() wrapper + shared hfDownloadCircuit singleton. ✓
  • Same applyHfEnvOverrides(env) + isHfDownloadFailure(errMsg) + early rethrow pattern. Behavioral parity confirmed. ✓
  • silenceStdout() + process.stderr.write = (() => true) as any wrap the entire withHfDownloadRetry() call; finally block correctly restores both. During retry backoff sleeps, stdout/stderr remain suppressed — intentional for MCP stdio protocol safety. ✓
  • Error from MCP embedder propagates to embedQuery() callers with actionable HF guidance in the message — useful for MCP client logs. ✓
  • Pino logger writes to fd 2 via SonicBoom. Since process.stderr.write is replaced during model load, pino records emitted during retries are silenced. After restoreStdout(), normal behavior resumes. ✓
  • No onRetry callback configured in MCP embedder — no retry-log noise on stderr. ✓

CLI error UX assessment

The isHfDownloadFailure branch is correctly placed before writeFatalToStderr at line 583, matching the RegistryNameCollisionError early-return pattern. Users now see exactly one focused message block with no preceding stack trace. ✓

Clean output:

  The embedding model could not be downloaded.
  huggingface.co may be unreachable from your network
  (e.g. behind a corporate proxy or a regional firewall).
  Suggestions:
    1. Set HF_ENDPOINT to a mirror and retry:
         HF_ENDPOINT=https://hf-mirror.com npx gitnexus analyze --embeddings
         (Windows: set HF_ENDPOINT=https://hf-mirror.com && npx gitnexus analyze --embeddings)
    2. Check your proxy / VPN settings.
    3. Once downloaded the model is cached — future runs work offline.

recoveryHint: 'hf-endpoint-unreachable' stable. ✓ Windows command syntax present. ✓ process.exitCode = 1 set once at the right scope. ✓ Unexpected errors still reach writeFatalToStderr + relevant cliError blocks. ✓

Test assessment

38 tests total across 6 describe blocks.

Scenario Status
All 5 network error patterns ✅
Non-network negatives (CUDA, empty, module-not-found) ✅
withDownloadTimeout success ✅
withDownloadTimeout timeout with fake timers ✅ (fixed from prior review)
Timeout propagates non-timeout errors ✅
Retry success after transient failure ✅
Retry exhaustion throws last network error ✅
Non-network error no retry ✅
Circuit opens after threshold ✅
Circuit fails fast while open ✅
Circuit half-open after reset ✅
Half-open → failure → reopens ✅ (fixed from prior review)
Half-open → success closes circuit ✅ (fixed from prior review)
onRetry callback args ✅
Circuit state reset between tests ✅ (fresh instances)
HF_DOWNLOAD_TIMEOUT_MS env override valid ❌ missing
HF_DOWNLOAD_TIMEOUT_MS env override invalid ❌ missing
HF_MAX_ATTEMPTS env override valid/invalid ❌ missing
Fake timers used correctly via vi.useFakeTimers() + try/finally in all time-sensitive tests. Backoff tests use baseDelayMs: 0. No singleton leakage between tests. ✓

Missing integration tests (desirable follow-ups, not blockers): mocked pipeline() to verify device fallback preserved for device errors; CLI analyze prints one clean cliError block; MCP embedder uses same helper.

Performance / UX assessment

Worst case (stalled partial connection): 3 × 5min + 6s ≈ 15 min 6 sec. Mitigated by HF_DOWNLOAD_TIMEOUT_MS / HF_MAX_ATTEMPTS overrides (documented in hf-env.ts comment, not yet in the error message itself).

Fast-fail cases (ENOTFOUND, ECONNREFUSED): ≈ 6 seconds total — acceptable for all common blocked-network scenarios.

Cached model: applyHfEnvOverrides sets env.cacheDir. If model is already cached, Transformers.js loads from disk with no network access. No retry penalty. ✓

Retry noise: onRetry logging only in NODE_ENV=development. Production users see no retry attempt logs. ✓

Hidden Unicode / hygiene assessment

grep -P "[\x{202A}-\x{202E}\x{2066}-\x{2069}]" across all 5 changed files: zero results. No bidirectional or directional Unicode controls anywhere.

Non-ASCII characters found: em-dashes (, U+2014) in JSDoc comments and the ETIMEDOUT message string. All are human-readable typographic characters in comments or logged strings — none in conditions, identifiers, or control-flow expressions. GitHub's hidden-Unicode warning refers to these benign characters. Not a blocker.

No generated files, no package.json changes, no package-lock churn. ✓

Elegance / maintainability assessment

hf-env.ts is cohesive: all HF download resilience helpers in one file, exported with @internal JSDoc, well-documented constants, clean type definitions. Retry/circuit logic not duplicated between core and MCP embedder — shared via import. withHfDownloadRetry API is small and injectable (circuit option for test isolation). Comments in embedder catch blocks correctly explain WHY device fallback is skipped — non-obvious reasoning, well-documented. Env override parsing is simple and safe. No unrelated embedding behavior changed. No over-broad retry blanket — non-network errors propagate immediately without retry.

Final verdict

production-ready with minor follow-ups

All prior review findings have been correctly addressed: the HF error check is before writeFatalToStderr (clean single-message UX), HF_DOWNLOAD_TIMEOUT_MS/HF_MAX_ATTEMPTS env overrides are exposed and validated, fake timers are used correctly in the timeout test, half-open → failure and half-open → success circuit tests are present, and Windows command syntax appears in all three output locations. The core correctness properties all hold: network errors are classified correctly without misclassifying device failures, pipeline() is wrapped exactly once per attempt in both embedders, circuit breaker state machine is correct, pino/cliError merge is clean.

The two remaining findings — missing unit tests for the env override parsing code path (medium, DoD gap) and no upper bound clamping on env override values (minor) — do not affect production behavior for any non-explicitly-misconfigured user. The env override logic itself is correct; only test coverage is absent. These are appropriate follow-ups before a high-traffic release but do not block merge.

@copilot Following DoD.md, you must fix these findings!

Copilot AI commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

Claude finished @magyargergo's task in 6m 8s —— View job

PR #1419 Review — Adversarial Production Readines...

Done in commits f3306f3 and e6de518. Both findings addressed:

[medium/DoD] Missing env override tests — Added 11 new tests in a describe('withHfDownloadRetry env overrides', ...) block (hf-env.test.ts): HF_MAX_ATTEMPTS=1/2 gives exactly N attempts, =abc/0/-1 falls back to built-in default (3), =9999 is clamped to HF_MAX_ATTEMPTS_CAP (10), =2.9 is floored to 2; HF_DOWNLOAD_TIMEOUT_MS valid value used as per-attempt timeout, =-1 falls back to default, large value clamped to HF_MAX_TIMEOUT_MS; explicit option overrides env var. Total: 49 tests.

[minor] No upper bound clamping — Added HF_MAX_TIMEOUT_MS = 30 * 60 * 1_000 (30 min) and HF_MAX_ATTEMPTS_CAP = 10 constants. resolvedTimeout is now Math.min(envTimeout, HF_MAX_TIMEOUT_MS); resolvedMaxAttempts is Math.min(Math.floor(envMaxAttempts), HF_MAX_ATTEMPTS_CAP). CodeQL: 0 alerts.

Comment thread gitnexus/test/unit/hf-env.test.ts Fixed
@magyargergo

Copy link
Copy Markdown
Collaborator

@copilot I don't like the fact that we have to cast the pipeline with any which is a clear violation of the implicit any casting and we don't want that let's fix this here so thecode becomes much more reliable

…sInfo; remove unused HF_DOWNLOAD_TIMEOUT_MS import

Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/43e60ccd-6f01-4cec-8ee3-c22e8b000efe

Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
…gress callback

Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/43e60ccd-6f01-4cec-8ee3-c22e8b000efe

Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>

Copilot AI commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@copilot I don't like the fact that we have to cast the pipeline with any which is a clear violation of the implicit any casting and we don't want that let's fix this here so thecode becomes much ...

Done in commits f5b62e4 and 1911b1d. All any casts related to pipeline are removed:

(pipeline as any)(...)pipeline(...) in both core/embeddings/embedder.ts and mcp/core/embedder.ts. The function is already typed in @huggingface/transformers as pipeline<T extends PipelineType>(task, model, options): Promise<AllTasks[T]> and requires no cast.

(data: any) in the progress callback(data: ProgressInfo) with proper discriminated-union narrowing:

  • status is mapped from 'progress_total''progress' (the TotalProgressInfo aggregate event has no ModelProgress equivalent), with a ?? 'progress' fallback
  • Optional fields (file, progress, loaded, total) are accessed with 'field' in data guards since they only exist on some union members

CodeQL alert (unused import)HF_DOWNLOAD_TIMEOUT_MS removed from hf-env.test.ts imports. The constant was never referenced in any assertion (only the string 'HF_DOWNLOAD_TIMEOUT_MS' as an env key).

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 4m 40s —— View job


PR #1419 Review — Adversarial Production Readiness

Task List

  • Post initial comment
  • Read repo context (DoD.md, AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, TESTING.md, ARCHITECTURE.md)
  • Inspect changed files (hf-env.ts, embedder.ts × 2, analyze.ts, hf-env.test.ts)
  • Inspect related files (config.ts, types.ts, cli-message.ts, logger.ts, package.json)
  • Check git log / head SHA / CI status
  • Hidden Unicode / bidi scan
  • Assess merge-conflict artifacts
  • Evaluate error classification correctness
  • Evaluate retry/timeout/circuit-breaker logic
  • Evaluate core + MCP embedder integration
  • Evaluate CLI UX
  • Evaluate test determinism + coverage
  • Evaluate env override parsing safety
  • Write final review

Review bar for this PR

  • isNetworkFetchError() must catch fetch failed / ECONNREFUSED / ENOTFOUND / ETIMEDOUT / ECONNRESET without misclassifying CUDA / WASM / ONNX device failures.
  • withHfDownloadRetry() must retry only network errors; non-network errors propagate immediately without retry.
  • Circuit breaker must open after threshold, fail fast while open, recover via half-open → closed; half-open failure must reopen.
  • Retry backoff must be bounded; tests must be deterministic using fake timers.
  • HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS env overrides must validate safely with upper bound clamping.
  • Network failure must short-circuit device fallback with actionable HF_ENDPOINT guidance; cliError() before writeFatalToStderr; no raw stack trace.
  • pino logger used consistently; no console.* regression; no pipeline as any cast.
  • Hidden Unicode/bidi controls absent from executable code.

Current PR state

Field Value
Head SHA 1911b1d675bd430ae30522acbd7b4294004ae84b
Status Open (non-draft)
Commits 14
Changed files 5 (hf-env.ts, embedder.ts, mcp/core/embedder.ts, analyze.ts, hf-env.test.ts)
CI ✅ All checks passed (typecheck ✓, unit tests 8213/8213 ✓, e2e ✓) per latest CI report
Typecheck tsc --noEmit success
hf-env.test.ts ✅ 49 tests across 6 describe blocks
Package state No new dependencies, no package-lock churn
Hidden Unicode ✅ Zero bidi/directional controls (U+202A–202E, U+2066–2069). Non-ASCII are em-dashes () and arrows () in JSDoc comments — benign.
Vercel Irrelevant to this embedding/CLI/MCP reliability review
Prior findings All findings from prior two review cycles confirmed fixed on this HEAD

Branch hygiene / merge-conflict assessment

Clean. Two merge-forward commits (4f96ff8, 3c160c8) resolved conflicts from the pino/cliError() merge correctly.

  • core/embeddings/embedder.ts — single clean import block at line 30: { applyHfEnvOverrides, isHfDownloadFailure, withHfDownloadRetry }. No duplicate imports. No console.* usage. pipeline() wrapped exactly once in withHfDownloadRetry() per device attempt. applyHfEnvOverrides(env) called before device loop. ProgressInfo typed correctly; progress_total'progress' mapping with ?? 'progress' fallback. No pipeline as any cast. ✓
  • mcp/core/embedder.ts — single import block lines 9–19. pipeline() wrapped exactly once. silenceStdout() + stderr suppression in the correct inner try/finally around withHfDownloadRetry. No duplicate applyHfEnvOverrides calls. One process.stderr.write = (() => true) as any remains — this is the pre-existing MCP stdio safety cast (arity mismatch with process.stderr.write's type signature makes the cast structurally necessary; it is not a pipeline cast). ✓
  • analyze.ts — single import of isHfDownloadFailure at line 30. process.exitCode = 1 set correctly. No console.error mixed with pino. No dangling braces or duplicate });. ✓

No any-cast escape hatches introduced for type-checking workarounds. No stale conflict comments.


Understanding of the change

TypeError: fetch failed was poor UX because it gave no indication that HF_ENDPOINT could reroute the download, and continued into a device-fallback loop (CUDA → CPU → WASM) that was meaningless for a network failure. HF_ENDPOINT is the right hint because applyHfEnvOverrides() already maps it to env.remoteHost, which Transformers.js reads. Network errors must skip device fallback because the model cannot be fetched on any device if the network is blocked. Retry/timeout/circuit-breaker helps transient jitter (packet loss, CDN edge resets) while bounding worst-case wait and preventing MCP servers from hammering an unreachable endpoint.


Findings

[minor] Worst-case failure delay not mentioned in user-facing error output

  • Category: CLI UX / Discoverability
  • Files: gitnexus/src/core/embeddings/hf-env.ts:270–276, gitnexus/src/cli/analyze.ts:584–597
  • Issue: HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS are now documented in code comments and correctly tested, but they are not mentioned in the error message shown to users. A user behind a slow partial-connection who waits 15 minutes before seeing the error has no in-band hint that they could set HF_DOWNLOAD_TIMEOUT_MS=60000 to bail out faster. The cliError() message covers HF_ENDPOINT, proxy, and cache — but not the timeout override.
  • Why it matters: Users most affected by the 15-minute wait (slow or stalled proxies) are precisely the users who would benefit from knowing the override exists.
  • Recommended fix: Append one line to the cliError() suggestions: 4. Reduce per-attempt wait: HF_DOWNLOAD_TIMEOUT_MS=60000 npx gitnexus analyze --embeddings. Low-priority; the env vars are documented and functional — this is discoverability polish. Fix this →
  • Blocks merge: no

[cosmetic] Belt-and-suspenders msg.includes('Failed to download embedding model') in analyze.ts

  • Category: Code clarity
  • Files: gitnexus/src/cli/analyze.ts:583
  • Issue: isHfDownloadFailure(msg) || msg.includes('Failed to download embedding model') — the second branch is unreachable independently because the embedder always wraps the original network-error string (which matches isNetworkFetchError) into the wrapper message. No behavioral impact; serves as a safety net for future callers.
  • Blocks merge: no

HF endpoint / error classification assessment

applyHfEnvOverrides() — correct. HF_ENDPOINTenv.remoteHost with trailing slash normalization. Whitespace-only values silently treated as unset via .trim() guard. env.cacheDir defaults to ~/.cache/huggingface. No credentials printed in the mirror hint (hardcoded URL). Called before pipeline() in both embedder entry points. ✓

Credential risk: process.env.HF_ENDPOINT is interpolated into the embedder's guidance hint when set (The configured endpoint (${process.env.HF_ENDPOINT}) may be unreachable.). Corporate endpoints with auth tokens in the URL would be exposed in error output. Accepted risk — consistent with prior review assessments.

isNetworkFetchError() — correct. Case-sensitive substring matches on Node.js errno constants (always uppercase). Does NOT match CUDA, DirectML, WASM, ONNX, Cannot find module, or any device/filesystem pattern. All 5 patterns covered by positive tests; CUDA/empty/module-not-found negative cases confirmed. ✓

isHfCircuitOpenError() — unique tag 'hf-circuit-open'. Not reachable via arbitrary user-controlled strings at this code path. ✓

isHfDownloadFailure() — correctly combines both guards. ✓


Retry / timeout / circuit breaker assessment

withDownloadTimeout(): Uses manual fn().then(resolve, reject) + clearTimeout pattern — correctly clears timer on both success and failure paths. Timeout error message contains ETIMEDOUT so isNetworkFetchError classifies it. Late fn() resolution after timeout fires cannot call circuit.recordSuccess() — outer Promise already settled, late resolve(v) is a no-op. Underlying pipeline() download cannot be cancelled (Transformers.js lacks AbortController support) and continues in the background — accepted platform limitation, documented in code comment. ✓

withHfDownloadRetry() — env override parsing:

  • Reads process.env.HF_DOWNLOAD_TIMEOUT_MS and process.env.HF_MAX_ATTEMPTS on every call (correct — allows runtime configuration).
  • Number.isFinite(v) && v > 0 guard rejects NaN, Infinity, 0, and negative values → falls back to built-in defaults.
  • Math.min(envTimeout, HF_MAX_TIMEOUT_MS) clamps to 30 minutes. ✓
  • Math.min(Math.floor(envMaxAttempts), HF_MAX_ATTEMPTS_CAP) floors fractional values and clamps to 10. ✓
  • Explicit options.* override env vars via destructuring defaults → override built-in constants. Correct precedence. ✓
  • Full 11-test suite validates all edge cases: valid values, abc/0/-1 fallback, 9999 clamping, 2.9 flooring, env vs explicit precedence, clamped timeout via fake timers.

Backoff: baseDelayMs * 2^attempt → 2s, 4s. Bounded. Tests use baseDelayMs: 0 for determinism. ✓

Non-network passthrough: isNetworkFetchError(lastError.message) gates retries. CUDA/ONNX/module errors propagate immediately without circuit.recordFailure(). ✓

Circuit opens mid-run: After recordFailure() on attempt N, circuit.isOpen() checked; if threshold reached, CIRCUIT_OPEN_TAG error thrown immediately (no further sleep or attempts). ✓

Singleton state: Module-level hfDownloadCircuit acceptable — CLI (short-lived) one invocation only; MCP (long-lived) correctly prevents hammering. Tests use fresh HfDownloadCircuitBreaker instances; no singleton leakage. ✓

Worst-case delay: Default 3 × 5min + 2s + 4s ≈ 15 min 6 sec for stalled partial connections. For fast-fail cases (ENOTFOUND, ECONNREFUSED), actual total ≈ 6 seconds. HF_DOWNLOAD_TIMEOUT_MS=60000 HF_MAX_ATTEMPTS=2 reduces to ~2 min. Overrides functional and tested; discoverability noted as minor finding.


Core embedder assessment

  • pipeline() wrapped exactly once per device attempt in withHfDownloadRetry(). Old direct call fully removed. ✓
  • applyHfEnvOverrides(env) called before device loop, once. ✓
  • ProgressInfo type imported from @huggingface/transformers; no data: any cast. 'progress_total''progress' mapping with ?? 'progress' defensive fallback correct. ✓
  • finalConfig.modelId, device, dtype, progress_callback, session_options all unchanged. ✓
  • currentDevice assigned only after successful pipeline init (line 238). ✓
  • Device fallback loop preserved: CUDA/DML error → loop continues to cpu. Network/circuit error → isHfDownloadFailure() → rethrow with actionable message + Windows syntax, device loop exits. ✓
  • logger.warn({ attempt, max, err: err.message }, '...') — correct pino object+message shape. Only fires in NODE_ENV=development. ✓
  • HTTP embedding backend (isHttpMode()) checked before all device setup — unaffected. ✓

MCP embedder assessment

  • Same withHfDownloadRetry() wrapper + shared hfDownloadCircuit singleton. ✓
  • Same applyHfEnvOverrides(env) + isHfDownloadFailure(errMsg) + early rethrow pattern. Behavioral parity with core embedder confirmed. ✓
  • silenceStdout() + process.stderr.write = (() => true) as any wrap the entire withHfDownloadRetry() call; finally block correctly restores both. The as any is structurally necessary — process.stderr.write's TypeScript type requires a callback-aware signature but the suppressor needs no args; this is a pre-existing MCP io-safety pattern, not a new type escape. ✓
  • During retry backoff sleeps, stdout/stderr remain suppressed — intentional for MCP stdio protocol safety. Acceptable and documented. ✓
  • Error propagates to embedQuery() callers with actionable HF guidance — useful for MCP client logs. ✓
  • Pino logger writes to fd 2 via SonicBoom; since process.stderr.write is replaced during model load, pino records during retries are silenced; restored after restoreStdout(). ✓

CLI error UX assessment

isHfDownloadFailure check at analyze.ts:583 is before writeFatalToStderr at line 606, with early return — matching the RegistryNameCollisionError pattern. Users see exactly one clean remediation block, no stack trace. ✓

  The embedding model could not be downloaded.
  huggingface.co may be unreachable from your network
  (e.g. behind a corporate proxy or a regional firewall).
  Suggestions:
    1. Set HF_ENDPOINT to a mirror and retry:
         HF_ENDPOINT=https://hf-mirror.com npx gitnexus analyze --embeddings
         (Windows: set HF_ENDPOINT=https://hf-mirror.com && npx gitnexus analyze --embeddings)
    2. Check your proxy / VPN settings.
    3. Once downloaded the model is cached — future runs work offline.

recoveryHint: 'hf-endpoint-unreachable' stable. ✓ Windows command syntax present. ✓ process.exitCode = 1 set once at the correct scope. ✓ Unexpected errors still reach writeFatalToStderr + relevant cliError blocks. ✓


Test assessment

49 tests total across 6 describe blocks.

Scenario Status
All 5 network error patterns
Non-network negatives (CUDA, empty, module-not-found)
withDownloadTimeout success
withDownloadTimeout timeout with fake timers
Timeout propagates non-timeout errors
Retry success after transient failure
Retry exhaustion throws last network error
Non-network error no retry
Circuit opens after threshold
Circuit fails fast while open
Circuit half-open after reset
Half-open → failure → reopens
Half-open → success closes circuit
onRetry callback args
Circuit state reset between tests ✅ (fresh instances)
HF_MAX_ATTEMPTS=1/2 → exact N attempts
HF_MAX_ATTEMPTS=abc/0/-1 → default fallback
HF_MAX_ATTEMPTS=9999 → clamped to cap
HF_MAX_ATTEMPTS=2.9 → floored to 2
HF_DOWNLOAD_TIMEOUT_MS valid → used as timeout ✅ (fake timers)
HF_DOWNLOAD_TIMEOUT_MS=-1 → default fallback
HF_DOWNLOAD_TIMEOUT_MS large → clamped ✅ (fake timers)
Explicit options override env vars

Fake timers used correctly via vi.useFakeTimers() + try/finally in all time-sensitive tests. Backoff tests use baseDelayMs: 0. No singleton leakage between tests. ✓


Performance / UX assessment

Worst case (stalled partial connection): 3 × 5min + 6s ≈ 15 min 6 sec. Mitigated by HF_DOWNLOAD_TIMEOUT_MS / HF_MAX_ATTEMPTS overrides (functional, tested, clamped). Discoverability improvement noted as minor follow-up.

Fast-fail (ENOTFOUND, ECONNREFUSED): ≈ 6 seconds total — acceptable for all common blocked-network scenarios.

Cached model: applyHfEnvOverrides sets env.cacheDir. Model already cached → Transformers.js loads from disk with no network access. No retry penalty. ✓

Retry noise: onRetry logging only in NODE_ENV=development. Production users see no retry attempt logs. ✓


Hidden Unicode / hygiene assessment

grep -nP "[\x{202A}-\x{202E}\x{2066}-\x{2069}]" across all 5 changed files: zero results. No bidirectional or directional Unicode controls anywhere.

Non-ASCII characters: em-dashes (, U+2014) and arrows () in JSDoc comments and ETIMEDOUT message string. All human-readable typographic characters in comments/logged strings — none in conditions, identifiers, or control-flow. GitHub's hidden-Unicode warning refers to these benign characters. Not a blocker.

No generated files, no package.json changes, no package-lock churn. ✓


Elegance / maintainability assessment

hf-env.ts is cohesive: all HF download resilience helpers in one file, exported with @internal JSDoc, well-documented constants, clean type definitions. Retry/circuit logic not duplicated between core and MCP embedder — shared via import. withHfDownloadRetry API is small and injectable (circuit option for test isolation). Env override parsing is simple, safe, and bounded. Comments in embedder catch blocks correctly explain WHY device fallback is skipped for network errors — non-obvious reasoning, well-documented. No unrelated embedding behavior changed. No over-broad retry blanket — non-network errors propagate immediately without retry.


Final verdict

production-ready

All findings from the two prior review cycles have been correctly addressed on the current HEAD (1911b1d): the HF error check runs before writeFatalToStderr with early return (clean single-message UX); HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS env overrides are exposed, validated, floored, and clamped; all 11 override edge cases are tested with fake timers where needed; half-open → failure and half-open → success circuit paths are covered; Windows command syntax appears in all three output locations; pipeline as any cast is fully removed in favor of typed ProgressInfo. The core correctness properties all hold: network errors are classified correctly without misclassifying device failures; pipeline() is wrapped exactly once per attempt in both embedders; device fallback is preserved for non-network errors and correctly short-circuited for network errors; circuit breaker state machine is correct; pino/cliError merge is clean. The one remaining as any (MCP process.stderr.write suppression) is a structurally necessary io-safety cast pre-existing the pipeline fix, not a type-safety regression. The two remaining observations — a discoverability note about HF_DOWNLOAD_TIMEOUT_MS in the error message, and a redundant msg.includes(...) fallback — are cosmetic and do not affect correctness or user safety.

@magyargergo magyargergo merged commit 9d01516 into main May 8, 2026
41 of 43 checks passed
@magyargergo magyargergo deleted the copilot/fix-mac-load-embedding-failed branch May 8, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1.6.4-rc.86 mac load Embedding failed

3 participants