fix(providers): throw ProviderNotConfiguredError instead of raw Error on null provider resolution#30716
Conversation
…ovider resolution Fixes VELLUM-ASSISTANT-BRAIN-EJ — the satellite callers (conversation-store, subagent manager, approval-generators) threw a raw Error when resolveDefaultProvider returned null. This bypassed the existing error classification pipeline in conversation-error.ts that provides actionable user-facing messages through ProviderNotConfiguredError. Also adds an early status check in tryResolveProviderForConnectionName so disabled connections short-circuit before attempting auth resolution. Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
✦ APPROVE
Value: 13 production devices hitting BRAIN-EJ (166 events) were getting a generic, unhelpful error banner instead of the actionable "API key required for connection X" message — this restores the classification pipeline for all three satellite callers.
What this does:
- Replaces
new Error(...)withProviderNotConfiguredErrorinconversation-store.ts,subagent/manager.ts, andapproval-generators.ts— the three callers that handle a null return fromresolveDefaultProvider() - Adds an early
connection.status === "disabled"guard intryResolveProviderForConnectionNameso disabled connections short-circuit cleanly before hitting auth resolution
Code review:
ProviderNotConfiguredError constructor signature — verified: (requestedProvider: string, registeredProviders: string[], attribution?: { connectionName?, profileName? }). All three throw sites pass (resolved.provider, listProviders(), { connectionName: resolved.provider_connection }) — matches exactly. ✅
classifyConversationError checks instanceof ProviderNotConfiguredError as its first branch and extracts error.connectionName for the banner. The raw Error throws were silently falling through to the regex fallback, producing useless output. Pipeline now fires correctly. ✅
The disabled early-return in connection-resolution.ts is placed at exactly the right point — after the provider-mismatch hard check, before resolveProviderFromConnection dips into auth. The debug-level log is appropriate (not noise, not silent). Callers will surface ProviderNotConfiguredError with connectionName, which tells BYOK users to configure their own key — the correct action. ✅
Minor observation: For an intentionally disabled managed connection, the user-facing banner will say "API key required for connection X" rather than "connection X is disabled." Semantically it's slightly imprecise, but the prompted action (configure a credential) is correct for BYOK scenarios. Not a blocker — worth a follow-up if user confusion is reported.
This PR is also a perfect application of the documented daemon anti-pattern: "DO NOT throw plain Error from typed error paths — callers use instanceof discrimination; plain Error silently bypasses it." (See anti-patterns KB, cf. PR #6726 web-side equivalent.)
CI: Type Check ✅ · Lint ✅ · FlexFrame Lint ✅ · OpenAPI ✅ · Security ✅ · Tests in-progress (description confirms 523 affected tests pass locally, 3 pre-existing failures on main). Codex 👍'd. Clean. Ready to merge once Test run completes.
Vellum Constitution — Trust-seeking: silent, generic errors replaced with actionable attribution — users now know exactly which connection to fix.
Test ResultsRan adversarial unit tests against the error classification fix. All 4 tests passed. Test Details
Source Verification
CI: 8 passed, 0 failed. Type Check ✅, Lint ✅, Test ✅. |
Closes #30714
Part of JARVIS-765
Prompt / plan
Sentry audit of 0.8.1 revealed BRAIN-EJ (166 events, 13 unique production devices) — satellite callers throwing a generic
ErrorwhenresolveDefaultProvider()returns null, bypassing the existing error classification pipeline.Changes
1. Replace raw
ErrorwithProviderNotConfiguredErrorin three satellite callersconversation-store.ts,subagent/manager.ts, andapproval-generators.tsall threwnew Error("...default provider 'anthropic' is not registered")whenresolveDefaultProvider()returned null. This bypassed the classification pipeline inconversation-error.tswhich specifically checksinstanceof ProviderNotConfiguredErrorand provides actionable user-facing messages ("API key required for connection X"). Users saw a generic unhelpful error instead of guidance.Now they throw
ProviderNotConfiguredErrorwithconnectionNameattribution, so the macOS chat banner renders the correct actionable message.2. Early
connection.status === "disabled"check intryResolveProviderForConnectionNameA disabled connection (e.g. managed connections on BYOK installs, flipped by
disableManagedConnectionsForByokHatch) was still going through auth resolution, which failed with a confusing transient-failure log. Now it short-circuits tonullimmediately with adebug-level log.Root cause analysis
How did the code get into this state? PR refactor(inference): drop legacy getProvider fallback (Phase 1.1 + 1.3) #30217 removed the legacy
getProvider(name)fallback and maderesolveDefaultProvider()returnnullon soft credential failures. The satellite callers were updated to handle null, but they threw rawErrorinstead ofProviderNotConfiguredError. The error type mismatch was not caught because the test mocks supply a working provider.What decisions led to it? The satellite callers were updated in a single refactoring PR that focused on the resolution architecture. The error classification pipeline (which checks
instanceof ProviderNotConfiguredError) was in a different file and the mismatch between the thrown type and the expected type wasn't flagged.Warning signs missed? The
conversation-error.tspipeline explicitly checks forProviderNotConfiguredError— any caller that throws a different error type for the same condition will bypass classification. This coupling is implicit and easy to miss.Prevention: When throwing errors in provider resolution paths, always use
ProviderNotConfiguredError(not rawError) so the existing classification pipeline can provide actionable messages.AGENTS.md update: Not warranted — this is a specific error-type mismatch, not a general pattern. The error-handling docs already cover the convention.
Alternatives NOT taken
Retry with backoff: Initial investigation considered retry, but Sentry data showed the same devices failing repeatedly over days (19 events from one device). The failures are permanent per-device (missing credentials, platform auth unavailable), not transient I/O blips. Retry would add latency before the same failure.
Propagating auth failure reason through the chain: Changing
tryResolveProviderForConnectionName's return type fromProvider | nullto a result object with error codes would give more precise error messages ("Platform auth unavailable" vs "Credential not found") but touches 8+ callers. The existingProviderNotConfiguredError+classifyConversationErrorpipeline already handles the distinction viagetProviderRoutingSource()(managed vs user-key), so the benefit is marginal vs the blast radius.Test plan
bunx tsc --noEmit— passesbun run lint— passesdispatch-connection-routing.test.ts— 4/4 passinference-no-mode-boot-e2e.test.ts— all connection-resolution tests pass (the 1 failure is pre-existing on main)Link to Devin session: https://app.devin.ai/sessions/056fc283f36e4364868105d8eeb64d8c
Requested by: @ashleeradka