fix: actionable HF_ENDPOINT guidance, retries, timeout and circuit breaker when embedding model download fails#1419
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…oad failure When `gitnexus analyze --embeddings` fails because huggingface.co is unreachable (e.g. the GFW, corporate proxies), the error was shown as a raw `TypeError: fetch failed` with no actionable guidance. Changes: - `hf-env.ts`: add and export `isNetworkFetchError()` helper that detects network-level fetch errors (fetch failed, ECONNREFUSED, ENOTFOUND, ETIMEDOUT, ECONNRESET) - `core/embeddings/embedder.ts`: in the device-fallback loop, detect network errors and rethrow immediately with a message telling the user to set HF_ENDPOINT to a mirror (hf-mirror.com) — device fallback is meaningless for network errors that will fail on every device - `mcp/core/embedder.ts`: same fix for the MCP embedder entry point - `cli/analyze.ts`: add a new error branch that detects fetch/network failures and prints a concrete 3-step remediation hint (HF_ENDPOINT, proxy/VPN, offline caching) - `test/unit/hf-env.test.ts`: add 8 unit tests covering all five network error patterns and three negative cases Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/3e314b1c-ca74-44d5-9913-c1418d4e160a Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
…uplication Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/3e314b1c-ca74-44d5-9913-c1418d4e160a Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
|
@copilot rebase on origin/main |
…conflicts Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
|
@copilot we also properly need retries and timeout with ciruit breakes in place. |
Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/6e6f509a-e4d5-4d1f-b0de-b2d30e0a0dce Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
Done in commit
21 new tests cover the circuit state machine, timeout, retry exhaustion, circuit opening mid-run, |
|
Claude finished @magyargergo's task in 5m 3s —— View job PR #1419 Review — Adversarial Production ReadinessTask List
Review bar for this PR
Current PR state
Branch hygiene / merge-conflict assessmentClean. The conflict from the pino/
Understanding of the changeWhy Why HF_ENDPOINT is the right hint: transformers.js reads Why network errors must skip device fallback: The model file cannot be downloaded on any device if the network is blocked. Falling back from CUDA to CPU to WASM wastes time and produces multiple identical failures. Why retry/timeout/circuit breaker helps: Transient jitter (packet loss, flaky CDN edge) benefits from a couple of retries with backoff. A per-attempt timeout prevents a stalled partial download from hanging forever. The circuit breaker prevents an MCP long-running process from repeatedly hammering an unreachable endpoint. Findings[medium] Duplicate guidance output for HF download failures
[medium] Worst-case failure delay ≈ 15 minutes for slow/partial connections
[minor]
|
| Scenario | Covered? |
|---|---|
| All 5 network error patterns | ✅ |
| Non-network negatives (CUDA, empty, module-not-found) | ✅ |
withDownloadTimeout success |
✅ |
withDownloadTimeout timeout rejects ETIMEDOUT |
✅ (real timer, minor flake risk) |
| Timeout propagates non-timeout errors | ✅ |
| Retry success after transient failure | ✅ |
| Retry exhaustion throws last network error | ✅ |
| Non-network error no retry | ✅ |
| Circuit opens after threshold | ✅ |
| Circuit fails fast while open | ✅ |
| Circuit half-open after reset timeout | ✅ |
| Half-open → failure → reopens | ❌ missing |
| Half-open → success closes circuit | ❌ missing (closes immediately but half-open via retry path not tested) |
onRetry callback args |
✅ |
| Circuit reset between tests | ✅ (fresh instances used) |
| Singleton not leaked between tests | ✅ |
isHfCircuitOpenError + isHfDownloadFailure |
✅ |
Fake timers — used correctly for the half-open transition test (try/finally pattern). ✓
Backoff tests — use baseDelayMs: 0 to avoid real delays. Deterministic. ✓
Missing integration tests — no mocked pipeline() tests covering: device fallback preserved for device errors; network error skips device fallback; CLI analyze printing guidance. Desirable follow-ups, not blockers.
Performance / UX assessment
Worst case (slow partial connection): 15 min 6 sec. Unacceptable for a completely blocked endpoint where connection-level errors fire immediately. Acceptable for the common (ENOTFOUND, ECONNREFUSED) case (~6 seconds total). Recommended follow-up: add HF_DOWNLOAD_TIMEOUT_MS env override to let users reduce per-attempt timeout.
Cached/offline behavior: applyHfEnvOverrides sets env.cacheDir. If the model is already cached at that path, Transformers.js loads from cache without network access. No retry penalty for cached models. ✓
Retry noise — onRetry logging only fires in NODE_ENV=development. Production users do not see retry-attempt logs. ✓
Hidden Unicode / hygiene assessment
No bidi/directional control characters (U+202A–U+202E, U+2066–U+2069) found in any of the 5 changed files.
Non-ASCII characters found:
- Em-dashes (
—, U+2014) — in JSDoc comments and string literals (e.g."— will reset in ~${secsUntilReset}s"). Harmless typographic characters in human-readable output. - Emojis (
🧠,🔧,✅,⚠️) inlogger.infocalls — these are pre-existing inembedder.ts, not introduced by this PR. - Arrows (
→,•) in comments and analyze.ts string literals — pre-existing.
GitHub's "hidden Unicode" warning for this PR refers to these benign characters. No executable code logic is affected by Unicode in conditions, identifiers, or control flow. Not a blocker.
No generated files, no package.json changes, no package-lock churn. ✓
Elegance / maintainability assessment
hf-env.ts is cohesive: all HF download resilience helpers in one file. Constants are named and documented with JSDoc. withHfDownloadRetry API is small (fn + optional overrides). The HfRetryOptions.circuit override for test injection is the right pattern.
Retry/circuit logic is not duplicated between core and MCP embedder — both import from hf-env.ts. Error classification is centralized. ✓
Comments in embedder catch blocks correctly explain WHY device fallback is skipped for network errors — this is non-obvious reasoning, well-documented. ✓
No unrelated embedding behavior changed. HTTP backend path unchanged. ✓
Final verdict
production-ready with minor follow-ups
The core correctness properties all hold: network errors are classified correctly and do not misclassify device failures; pipeline() is wrapped exactly once per attempt; device fallback is correctly preserved for ONNX/CUDA errors and correctly skipped for network errors; circuit breaker state transitions are implemented correctly; CLI and MCP embedder are behaviourally aligned; the pino/cliError merge is clean with no conflict artifacts.
The two medium findings — duplicate guidance output (the writeFatalToStderr before the cliError HF block) and the 15-minute worst-case wait for slow partial connections — are real but follow the existing codebase pattern for other error types and only affect slow-connection edge cases respectively. The missing half-open tests and real-timer test are minor hygiene gaps.
Follow-ups recommended before 1.0 or high-traffic release: (1) move HF error handling before writeFatalToStderr to eliminate duplicate output and stack-trace noise, (2) expose HF_DOWNLOAD_TIMEOUT_MS as an env override, (3) add half-open state machine tests, (4) add Windows env-var syntax note.
|
@copilot tackle with all of the findings from above plus make sure |
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 8213 tests passed 1 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
|
Claude finished @magyargergo's task in 6m 8s —— View job PR #1419 Review — Adversarial Production ReadinessTask List
Review bar for this PR
Current PR state
Branch hygiene / merge-conflict assessmentClean. Two merge-forward commits (
No Understanding of the change
Findings[medium] No unit tests for
|
| Scenario | Status |
|---|---|
| All 5 network error patterns | ✅ |
| Non-network negatives (CUDA, empty, module-not-found) | ✅ |
withDownloadTimeout success |
✅ |
withDownloadTimeout timeout with fake timers |
✅ (fixed from prior review) |
| Timeout propagates non-timeout errors | ✅ |
| Retry success after transient failure | ✅ |
| Retry exhaustion throws last network error | ✅ |
| Non-network error no retry | ✅ |
| Circuit opens after threshold | ✅ |
| Circuit fails fast while open | ✅ |
| Circuit half-open after reset | ✅ |
| Half-open → failure → reopens | ✅ (fixed from prior review) |
| Half-open → success closes circuit | ✅ (fixed from prior review) |
onRetry callback args |
✅ |
| Circuit state reset between tests | ✅ (fresh instances) |
HF_DOWNLOAD_TIMEOUT_MS env override valid |
❌ missing |
HF_DOWNLOAD_TIMEOUT_MS env override invalid |
❌ missing |
HF_MAX_ATTEMPTS env override valid/invalid |
❌ missing |
Fake timers used correctly via vi.useFakeTimers() + try/finally in all time-sensitive tests. Backoff tests use baseDelayMs: 0. No singleton leakage between tests. ✓
Missing integration tests (desirable follow-ups, not blockers): mocked pipeline() to verify device fallback preserved for device errors; CLI analyze prints one clean cliError block; MCP embedder uses same helper.
Performance / UX assessment
Worst case (stalled partial connection): 3 × 5min + 6s ≈ 15 min 6 sec. Mitigated by HF_DOWNLOAD_TIMEOUT_MS / HF_MAX_ATTEMPTS overrides (documented in hf-env.ts comment, not yet in the error message itself).
Fast-fail cases (ENOTFOUND, ECONNREFUSED): ≈ 6 seconds total — acceptable for all common blocked-network scenarios.
Cached model: applyHfEnvOverrides sets env.cacheDir. If model is already cached, Transformers.js loads from disk with no network access. No retry penalty. ✓
Retry noise: onRetry logging only in NODE_ENV=development. Production users see no retry attempt logs. ✓
Hidden Unicode / hygiene assessment
grep -P "[\x{202A}-\x{202E}\x{2066}-\x{2069}]" across all 5 changed files: zero results. No bidirectional or directional Unicode controls anywhere.
Non-ASCII characters found: em-dashes (—, U+2014) in JSDoc comments and the ETIMEDOUT message string. All are human-readable typographic characters in comments or logged strings — none in conditions, identifiers, or control-flow expressions. GitHub's hidden-Unicode warning refers to these benign characters. Not a blocker.
No generated files, no package.json changes, no package-lock churn. ✓
Elegance / maintainability assessment
hf-env.ts is cohesive: all HF download resilience helpers in one file, exported with @internal JSDoc, well-documented constants, clean type definitions. Retry/circuit logic not duplicated between core and MCP embedder — shared via import. withHfDownloadRetry API is small and injectable (circuit option for test isolation). Comments in embedder catch blocks correctly explain WHY device fallback is skipped — non-obvious reasoning, well-documented. Env override parsing is simple and safe. No unrelated embedding behavior changed. No over-broad retry blanket — non-network errors propagate immediately without retry.
Final verdict
production-ready with minor follow-ups
All prior review findings have been correctly addressed: the HF error check is before writeFatalToStderr (clean single-message UX), HF_DOWNLOAD_TIMEOUT_MS/HF_MAX_ATTEMPTS env overrides are exposed and validated, fake timers are used correctly in the timeout test, half-open → failure and half-open → success circuit tests are present, and Windows command syntax appears in all three output locations. The core correctness properties all hold: network errors are classified correctly without misclassifying device failures, pipeline() is wrapped exactly once per attempt in both embedders, circuit breaker state machine is correct, pino/cliError merge is clean.
The two remaining findings — missing unit tests for the env override parsing code path (medium, DoD gap) and no upper bound clamping on env override values (minor) — do not affect production behavior for any non-explicitly-misconfigured user. The env override logic itself is correct; only test coverage is absent. These are appropriate follow-ups before a high-traffic release but do not block merge.
@copilot Following DoD.md, you must fix these findings! |
Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/198700cc-4191-4ac1-94ec-33d61f2a99f7 Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/198700cc-4191-4ac1-94ec-33d61f2a99f7 Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
Done in commits [medium/DoD] Missing env override tests — Added 11 new tests in a [minor] No upper bound clamping — Added |
|
@copilot I don't like the fact that we have to cast the |
…sInfo; remove unused HF_DOWNLOAD_TIMEOUT_MS import Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/43e60ccd-6f01-4cec-8ee3-c22e8b000efe Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
…gress callback Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/43e60ccd-6f01-4cec-8ee3-c22e8b000efe Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
Done in commits
CodeQL alert (unused import) — |
|
Claude finished @magyargergo's task in 4m 40s —— View job PR #1419 Review — Adversarial Production ReadinessTask List
Review bar for this PR
Current PR state
Branch hygiene / merge-conflict assessmentClean. Two merge-forward commits (
No Understanding of the change
Findings[minor] Worst-case failure delay not mentioned in user-facing error output
[cosmetic] Belt-and-suspenders
|
| Scenario | Status |
|---|---|
| All 5 network error patterns | ✅ |
| Non-network negatives (CUDA, empty, module-not-found) | ✅ |
withDownloadTimeout success |
✅ |
withDownloadTimeout timeout with fake timers |
✅ |
| Timeout propagates non-timeout errors | ✅ |
| Retry success after transient failure | ✅ |
| Retry exhaustion throws last network error | ✅ |
| Non-network error no retry | ✅ |
| Circuit opens after threshold | ✅ |
| Circuit fails fast while open | ✅ |
| Circuit half-open after reset | ✅ |
| Half-open → failure → reopens | ✅ |
| Half-open → success closes circuit | ✅ |
onRetry callback args |
✅ |
| Circuit state reset between tests | ✅ (fresh instances) |
HF_MAX_ATTEMPTS=1/2 → exact N attempts |
✅ |
HF_MAX_ATTEMPTS=abc/0/-1 → default fallback |
✅ |
HF_MAX_ATTEMPTS=9999 → clamped to cap |
✅ |
HF_MAX_ATTEMPTS=2.9 → floored to 2 |
✅ |
HF_DOWNLOAD_TIMEOUT_MS valid → used as timeout |
✅ (fake timers) |
HF_DOWNLOAD_TIMEOUT_MS=-1 → default fallback |
✅ |
HF_DOWNLOAD_TIMEOUT_MS large → clamped |
✅ (fake timers) |
| Explicit options override env vars | ✅ |
Fake timers used correctly via vi.useFakeTimers() + try/finally in all time-sensitive tests. Backoff tests use baseDelayMs: 0. No singleton leakage between tests. ✓
Performance / UX assessment
Worst case (stalled partial connection): 3 × 5min + 6s ≈ 15 min 6 sec. Mitigated by HF_DOWNLOAD_TIMEOUT_MS / HF_MAX_ATTEMPTS overrides (functional, tested, clamped). Discoverability improvement noted as minor follow-up.
Fast-fail (ENOTFOUND, ECONNREFUSED): ≈ 6 seconds total — acceptable for all common blocked-network scenarios.
Cached model: applyHfEnvOverrides sets env.cacheDir. Model already cached → Transformers.js loads from disk with no network access. No retry penalty. ✓
Retry noise: onRetry logging only in NODE_ENV=development. Production users see no retry attempt logs. ✓
Hidden Unicode / hygiene assessment
grep -nP "[\x{202A}-\x{202E}\x{2066}-\x{2069}]" across all 5 changed files: zero results. No bidirectional or directional Unicode controls anywhere.
Non-ASCII characters: em-dashes (—, U+2014) and arrows (→) in JSDoc comments and ETIMEDOUT message string. All human-readable typographic characters in comments/logged strings — none in conditions, identifiers, or control-flow. GitHub's hidden-Unicode warning refers to these benign characters. Not a blocker. ✓
No generated files, no package.json changes, no package-lock churn. ✓
Elegance / maintainability assessment
hf-env.ts is cohesive: all HF download resilience helpers in one file, exported with @internal JSDoc, well-documented constants, clean type definitions. Retry/circuit logic not duplicated between core and MCP embedder — shared via import. withHfDownloadRetry API is small and injectable (circuit option for test isolation). Env override parsing is simple, safe, and bounded. Comments in embedder catch blocks correctly explain WHY device fallback is skipped for network errors — non-obvious reasoning, well-documented. No unrelated embedding behavior changed. No over-broad retry blanket — non-network errors propagate immediately without retry.
Final verdict
production-ready
All findings from the two prior review cycles have been correctly addressed on the current HEAD (1911b1d): the HF error check runs before writeFatalToStderr with early return (clean single-message UX); HF_DOWNLOAD_TIMEOUT_MS and HF_MAX_ATTEMPTS env overrides are exposed, validated, floored, and clamped; all 11 override edge cases are tested with fake timers where needed; half-open → failure and half-open → success circuit paths are covered; Windows command syntax appears in all three output locations; pipeline as any cast is fully removed in favor of typed ProgressInfo. The core correctness properties all hold: network errors are classified correctly without misclassifying device failures; pipeline() is wrapped exactly once per attempt in both embedders; device fallback is preserved for non-network errors and correctly short-circuited for network errors; circuit breaker state machine is correct; pino/cliError merge is clean. The one remaining as any (MCP process.stderr.write suppression) is a structurally necessary io-safety cast pre-existing the pipeline fix, not a type-safety regression. The two remaining observations — a discoverability note about HF_DOWNLOAD_TIMEOUT_MS in the error message, and a redundant msg.includes(...) fallback — are cosmetic and do not affect correctness or user safety.
gitnexus analyze --embeddingssilently crashed withTypeError: fetch failedon networks wherehuggingface.cois unreachable (GFW, corporate proxies). The codebase already bridgedHF_ENDPOINT→env.remoteHostfor mirror support, but no error path surfaced that option, and transient network failures were not retried.Changes
hf-env.ts— exports several new helpers and constants:isNetworkFetchError(msg)— detects network-level fetch failures (fetch failed,ECONNREFUSED,ENOTFOUND,ETIMEDOUT,ECONNRESET)HfDownloadCircuitBreaker— closed/open/half-open state machine; opens after 3 consecutive network failures, resets after 60 s (both configurable). A module-level singletonhfDownloadCircuitis shared by both embedder entry points.withDownloadTimeout(fn, ms)— wraps a download in a hard time-limit (5 minutes default); timeout throwsETIMEDOUTsoisNetworkFetchErrorclassifies it correctlywithHfDownloadRetry(fn, opts?)— wrapspipeline()with per-attempt timeout, exponential-backoff retry on network errors (3 attempts, 2 s → 4 s delay), circuit-breaker recording, and an optionalonRetrycallback for logging. The per-attempt timeout and max attempts are overridable viaHF_DOWNLOAD_TIMEOUT_MSandHF_MAX_ATTEMPTSenv vars; both are validated and clamped (timeoutMs≤ 30 min,maxAttemptsfloored and ≤ 10) to prevent accidental runaway configuration.isHfDownloadFailure(msg)— combinesisNetworkFetchErrorandisHfCircuitOpenErrorfor a single guard covering both raw network errors and circuit-open rejectionsHF_MAX_TIMEOUT_MS/HF_MAX_ATTEMPTS_CAP— exported upper-bound constants used for env-override clampingcore/embeddings/embedder.ts+mcp/core/embedder.ts—pipeline()call is now wrapped inwithHfDownloadRetry(); the device-fallback catch block usesisHfDownloadFailureand rethrows immediately with an actionable message when the failure is network-level (device fallback is meaningless for network errors). The message includes both Unix and Windows-compatible command syntax:cli/analyze.ts— the HF download failure branch is now checked beforewriteFatalToStderr, following the same early-return pattern asRegistryNameCollisionError. This eliminates duplicate output (previously a raw stack trace with an embedded hint was printed, followed by a secondcliError()remediation block). ThecliError()message also includes the Windows-compatible command syntax.test/unit/hf-env.test.ts— 49 tests total (34 new) covering: circuit breaker state machine transitions (including half-open → failure → reopen and half-open → success → closed paths),withDownloadTimeout(usingvi.useFakeTimers()for full determinism), retry exhaustion, circuit opening mid-run,onRetrycallback, non-network error passthrough, all original error pattern / negative cases, and a fullwithHfDownloadRetry env overridessuite validatingHF_MAX_ATTEMPTS(valid values, invalid/zero/negative fallback to default, clamping to 10, floor of fractional values),HF_DOWNLOAD_TIMEOUT_MS(valid, invalid fallback, clamping to 30 min), and explicit options taking precedence over env vars; all time-sensitive assertions usevi.useFakeTimers()