Skip to content

fix: prevent send on closed channel panic in provider queue shutdown --skip-pipeline#2725

Merged
akshaydeo merged 1 commit intomainfrom
04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload
Apr 17, 2026
Merged

fix: prevent send on closed channel panic in provider queue shutdown --skip-pipeline#2725
akshaydeo merged 1 commit intomainfrom
04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload

Conversation

@Pratham-Mishra04
Copy link
Copy Markdown
Collaborator

@Pratham-Mishra04 Pratham-Mishra04 commented Apr 15, 2026

Summary

Fixes a race condition in provider queue shutdown that caused "send on closed channel" panics in production. The issue occurred when producers passed the isClosing() check but then attempted to send to a queue that was closed before they reached the select statement.

Changes

  • Removed queue channel closure: Queue channels are never closed to prevent "send on closed channel" panics
  • Updated worker exit mechanism: Workers now exit via the done channel signal instead of waiting for queue closure
  • Enhanced shutdown handling: Workers drain remaining buffered requests and send shutdown errors when done is signaled
  • Added producer re-routing: Stale producers can transparently re-route to new queues during UpdateProvider
  • Improved error handling: Added rollback logic for failed provider updates with proper cleanup
  • Enhanced transfer logic: Buffered requests are transferred before signaling shutdown to ensure they reach new workers
  • Added comprehensive tests: Race condition demonstration and validation of the fix across multiple scenarios

Type of change

  • Bug fix
  • Feature
  • Refactor
  • Documentation
  • Chore/CI

Affected areas

  • Core (Go)
  • Transports (HTTP)
  • Providers/Integrations
  • Plugins
  • UI (Next.js)
  • Docs

How to test

Run the new race condition test to verify the fix:

go test -run TestProviderQueue_SendOnClosedChannel_Race ./core -v

Run the comprehensive provider lifecycle tests:

go test -run TestProviderQueue ./core -v
go test -run TestUpdateProvider ./core -v
go test -run TestRemoveProvider ./core -v

Run the full test suite to ensure no regressions:

go test ./...

Screenshots/Recordings

N/A

Breaking changes

  • Yes
  • No

Related issues

Fixes production panics related to concurrent provider queue operations during shutdown/updates.

Security considerations

None - this is an internal concurrency fix that doesn't affect external interfaces or data handling.

Checklist

  • I read docs/contributing/README.md and followed the guidelines
  • I added/updated tests where appropriate
  • I updated documentation where needed
  • I verified builds succeed (Go and UI)
  • I verified the CI pipeline passes locally if applicable

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Collaborator Author

Pratham-Mishra04 commented Apr 15, 2026

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

ProviderQueue no longer closes pq.queue; shutdown is driven by pq.signalClosing() / pq.done. Workers select on pq.queue and <-pq.done>, then drain remaining messages sending "provider is shutting down" errors. Update/Remove move or reroute buffered messages before signaling shutdown; context-cancel no longer returns pooled messages.

Changes

Cohort / File(s) Summary
Core shutdown & worker logic
core/bifrost.go
Removed closeQueue() and closeOnce; pq.queue remains open. Worker loop rewritten to select on pq.queue and <-pq.done>. Added drainQueueWithErrors. RemoveProvider signals closing, waits, then drains residuals before removing provider (provider-slice removal uses CAS). UpdateProvider transfers buffered messages to newPq before signalling old shutdown; overflow messages receive "queue full" errors; failures roll back by signalling and draining newPq. Producer paths (tryRequest/tryStreamRequest) attempt reroute to a non-closing replacement or return "provider is shutting down". On ctx.Done() they no longer call releaseChannelMessage(msg). Shutdown() signals all queues, waits for workers, then final-drains queues with errors.
Tests: concurrency & lifecycle
core/bifrost_test.go
Added runtime and sync/atomic imports, test helpers (e.g., newTestChannelMessage) and many concurrency/lifecycle tests: deterministic TOCTOU race reproduction, isClosing()/signalClosing() invariants, worker shutdown without closing queue, buffered-drain error checks, UpdateProvider transfer/reroute/rollback tests, and RemoveProvider shutdown/TOCTOU stress tests ensuring no panics and correct error propagation.
Minor formatting
transports/bifrost-http/handlers/providers.go
Removed two blank lines; no behavioral change.

Sequence Diagram(s)

sequenceDiagram
    participant Producer
    participant TryReq as tryRequest / tryStreamRequest
    participant ReqQueues as requestQueues
    participant OldPQ as ProviderQueue_old
    participant NewPQ as ProviderQueue_new
    participant Worker_old as Worker_old
    participant Done_old as pq.done_old
    participant MsgErr as msg.Err

    Producer->>TryReq: submit request
    TryReq->>ReqQueues: lookup provider queue
    ReqQueues->>OldPQ: return queue
    TryReq->>OldPQ: check isClosing()
    alt replacement exists
        TryReq->>NewPQ: reroute request
        NewPQ->>NewPQ: enqueue request
    else no replacement
        TryReq-->>Producer: "provider is shutting down"
    end

    Note over OldPQ,Done_old: Update/Remove transfer buffered items → then call pq.signalClosing()
    Caller->>OldPQ: pq.signalClosing()
    OldPQ->>Done_old: notify workers
    Worker_old->>OldPQ: drain residual messages
    loop per drained message
        Worker_old->>MsgErr: send "provider is shutting down"
    end
    Worker_old->>Worker_old: wg.Done() and exit
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nudged the queues and hopped in line,

Moved buffered notes before the closing sign.
Workers listened, then gently sent their say,
“Provider is shutting down” — soft on the way.
The burrow hums; no panics chase the day.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 93.75% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description covers all required sections from the template with substantial detail on the problem, changes, testing approach, and checklist items.
Title check ✅ Passed The PR title accurately describes the main change: preventing send-on-closed-channel panics during provider queue shutdown by fundamentally restructuring shutdown logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload

Comment @coderabbitai help to get the list of available commands and usage tips.

@Pratham-Mishra04 Pratham-Mishra04 changed the title fix: prevent send on closed channel panic in provider queue shutdown [DO NOT MERGE] fix: prevent send on closed channel panic in provider queue shutdown Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Suite Available

This PR can be tested by a repository admin.

Run tests for PR #2725

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Confidence Score: 3/5

Not ready to merge — prior review comments flagged P1 reliability concerns (N-worker concurrent drain race, residual post-drain enqueue window) that remain open, and the PR is explicitly marked DO NOT MERGE.

The core channel-safety fix is sound and several prior concerns were addressed (double-release removed, time.After leak in drain loop removed, ctx.Done() added to stream response-wait). However, the concurrent N-worker drain race and the acknowledged post-drain TOCTOU window that can strand requests until context expiry are still open P1 concerns from earlier review rounds. New findings in this pass are P2 only (dead !ok test branches, inaccurate doc comment).

core/bifrost.go — the drain-loop concurrency and post-drain window; core/bifrost_test.go — dead !ok branches in two test goroutines

Important Files Changed

Filename Overview
core/bifrost.go Core fix: removes closeQueue() and switches workers to exit via pq.done instead of channel close. Adds drainQueueWithErrors(), re-routing logic in tryRequest/tryStreamRequest, ctx.Done() to tryStreamRequest response-wait, and rollback paths for UpdateProvider errors. A doc comment overstates the write-lock guarantee for stale producers.
core/bifrost_test.go Adds comprehensive test suite covering the TOCTOU race, drain behaviour, worker lifecycle, and concurrent UpdateProvider/RemoveProvider scenarios. Two test goroutines retain a dead !ok check on pq.queue that contradicts the fix's never-close invariant.
transports/bifrost-http/handlers/providers.go Trivial blank-line removals after function calls; no logic changes.

Reviews (12): Last reviewed commit: "fix: send to closed channels edge case r..." | Re-trigger Greptile

Comment thread core/bifrost.go Outdated
Comment thread core/bifrost.go
Comment thread core/bifrost_test.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/bifrost_test.go`:
- Around line 1304-1394: Test reproduces the old bug by calling pq.closeQueue()
directly; change TestProviderQueue_SendOnClosedChannel_Race to drive shutdown
via the real public flow (e.g. call RemoveProvider/UpdateProvider/Shutdown that
will invoke ProviderQueue.signalClosing()/closeQueue() internally) instead of
calling ProviderQueue.closeQueue() directly, and exercise the producer path
through tryRequest (or the code that sends to pq.queue with the pq.done guard)
so the test asserts there is no panic, no hang, and that the shutdown call
returns an error to the caller when appropriate; update assertions to fail on
any panic or hang and to validate the public API shutdown behavior rather than
expecting a panic from direct pq.closeQueue() use.

In `@core/bifrost.go`:
- Around line 4937-4971: After signalClosing(), producers that passed
isClosing() can still send into pq.queue and get stuck if workers exit on a
default empty-queue return; fix by adding a sender quiesce/refcount so workers
don't stop servicing the old queue until all pre-close senders complete.
Concretely: in tryRequest and tryStreamRequest increment a sender counter
(atomic or sync.WaitGroup) before attempting pq.queue <- msg and decrement after
the send (or on error/timeout), have signalClosing() wait for that counter to
reach zero (or call Done on the WaitGroup) before closing/marking pq.done, and
ensure queue closure is performed via sync.Once to avoid send-on-closed-channel
races; keep pq.queue reads draining until the refcount is zero and pq.done is
closed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 268036c0-f15e-4768-9053-51df70324f75

📥 Commits

Reviewing files that changed from the base of the PR and between 67938e2 and 7b417a6.

📒 Files selected for processing (2)
  • core/bifrost.go
  • core/bifrost_test.go

Comment thread core/bifrost_test.go
Comment thread core/bifrost.go
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 7b417a6 to 592c6e1 Compare April 15, 2026 20:05
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
core/bifrost.go (1)

4955-4981: ⚠️ Potential issue | 🔴 Critical

Workers can still exit while stale producers enqueue into the old queue.

After done is closed, stale producers can still win pq.queue <- msg over <-pq.done, and this drain loop can return on a transient empty/default before those sends occur. That leaves messages in an open queue with no workers, causing caller hangs until context timeout.

Suggested fix (sender quiesce barrier)
 type ProviderQueue struct {
 	queue      chan *ChannelMessage
 	done       chan struct{}
 	closing    uint32
+	activeSenders atomic.Int64
 	signalOnce sync.Once
 	closeOnce  sync.Once
 }
+
+func (pq *ProviderQueue) beginSend() bool {
+	if pq.isClosing() {
+		return false
+	}
+	pq.activeSenders.Add(1)
+	if pq.isClosing() {
+		pq.activeSenders.Add(-1)
+		return false
+	}
+	return true
+}
+
+func (pq *ProviderQueue) endSend() {
+	pq.activeSenders.Add(-1)
+}
- select {
+ if !pq.beginSend() {
+   bifrost.releaseChannelMessage(msg)
+   return nil, newBifrostErrorFromMsg("provider is shutting down")
+ }
+ defer pq.endSend()
+ select {
   case pq.queue <- msg:
   case <-pq.done:
-               default:
-                   return
+               default:
+                   if pq.activeSenders.Load() == 0 {
+                       return
+                   }
+                   time.Sleep(1 * time.Millisecond)
# Stress check for orphaned request hangs under shutdown/update races
go test -run TestProviderQueue_SendOnClosedChannel_Race ./core -count=200 -race -timeout=60s

As per coding guidelines, "Use ProviderQueue pattern with atomic flags and sync.Once to prevent 'send on closed channel' panics when managing provider request queues."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost.go` around lines 4955 - 4981, The drain loop can return while
producers still can succeed in sending to pq.queue after pq.done is closed,
leaving messages orphaned; fix by implementing a quiesce/close protocol on
ProviderQueue: add an atomic boolean (e.g. pq.closed) and a sync.Once close
operation that marks pq.closed and prevents further sends, make the enqueue path
check pq.closed (return an error rather than blocking) before attempting
pq.queue <- msg, and have shutdown use the sync.Once to close and then wait for
workers to fully drain the queue (or use a WaitGroup) before returning; update
the drain loop (the section reading from pq.queue and using
releaseChannelMessage) to rely on the closed flag + empty queue semantics so no
messages can be enqueued after shutdown begins.
core/bifrost_test.go (1)

1306-1396: ⚠️ Potential issue | 🟠 Major

This test still validates the old panic path, not the fixed shutdown contract.

Line 1381 manually calls pq.closeQueue(), and Lines 1392-1394 require panic occurrence for success. That makes CI success depend on a deprecated unsafe path rather than RemoveProvider/UpdateProvider/Shutdown behavior after the fix. Please remove or rewrite this test to drive the public shutdown/update flow and assert no panic, no hang, and shutdown error propagation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost_test.go` around lines 1306 - 1396, The test
TestProviderQueue_SendOnClosedChannel_Race still forces the old unsafe path by
calling pq.closeQueue() directly and expecting a panic; instead rewrite it to
exercise the public shutdown/update API (e.g. call the real
RemoveProvider/UpdateProvider or Shutdown code paths that replace
pq.signalClosing()/pq.closeQueue()) and assert the new contract: no "send on
closed channel" panic, no hangs, and proper error propagation from the public
shutdown/update calls. Replace direct calls to
pq.closeQueue()/pq.signalClosing() with the higher-level methods that operate on
ProviderQueue (the code paths that tryRequest uses: isClosing(), tryRequest, and
the public RemoveProvider/UpdateProvider/Shutdown flows), run many iterations to
exercise races but fail the test if any panic/hang occurs, and add assertions
that the public call returns the expected shutdown/update error where
applicable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/bifrost.go`:
- Around line 4386-4388: The code currently calls bifrost.getProviderQueue in
the closing-path reroute which lazily creates a provider queue and can resurrect
a provider during shutdown; change the lookup to a non-creating accessor (e.g. a
getProviderQueueNoCreate / lookupProviderQueue / providerQueueIfExists method)
so it only returns an existing queue or an error without creating one, update
the branch that assigns pq to use that non-creating lookup instead of
bifrost.getProviderQueue, and make the same replacement at the other occurrence
mentioned (around lines 4638-4640) and in RemoveProvider-related cleanup paths
to avoid recreating providers during shutdown.

---

Duplicate comments:
In `@core/bifrost_test.go`:
- Around line 1306-1396: The test TestProviderQueue_SendOnClosedChannel_Race
still forces the old unsafe path by calling pq.closeQueue() directly and
expecting a panic; instead rewrite it to exercise the public shutdown/update API
(e.g. call the real RemoveProvider/UpdateProvider or Shutdown code paths that
replace pq.signalClosing()/pq.closeQueue()) and assert the new contract: no
"send on closed channel" panic, no hangs, and proper error propagation from the
public shutdown/update calls. Replace direct calls to
pq.closeQueue()/pq.signalClosing() with the higher-level methods that operate on
ProviderQueue (the code paths that tryRequest uses: isClosing(), tryRequest, and
the public RemoveProvider/UpdateProvider/Shutdown flows), run many iterations to
exercise races but fail the test if any panic/hang occurs, and add assertions
that the public call returns the expected shutdown/update error where
applicable.

In `@core/bifrost.go`:
- Around line 4955-4981: The drain loop can return while producers still can
succeed in sending to pq.queue after pq.done is closed, leaving messages
orphaned; fix by implementing a quiesce/close protocol on ProviderQueue: add an
atomic boolean (e.g. pq.closed) and a sync.Once close operation that marks
pq.closed and prevents further sends, make the enqueue path check pq.closed
(return an error rather than blocking) before attempting pq.queue <- msg, and
have shutdown use the sync.Once to close and then wait for workers to fully
drain the queue (or use a WaitGroup) before returning; update the drain loop
(the section reading from pq.queue and using releaseChannelMessage) to rely on
the closed flag + empty queue semantics so no messages can be enqueued after
shutdown begins.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 36a8b354-7f23-48c7-bc89-0cf443399215

📥 Commits

Reviewing files that changed from the base of the PR and between 7b417a6 and 592c6e1.

📒 Files selected for processing (2)
  • core/bifrost.go
  • core/bifrost_test.go

Comment thread core/bifrost.go Outdated
Comment thread core/bifrost.go
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 592c6e1 to 4b9cdf2 Compare April 16, 2026 07:40
coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 16, 2026
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 4b9cdf2 to 1c91a1f Compare April 16, 2026 07:55
coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 16, 2026
Comment thread core/bifrost.go Outdated
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 1c91a1f to be15b50 Compare April 16, 2026 08:11
coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 16, 2026
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from be15b50 to 7993b5e Compare April 16, 2026 08:52
coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 16, 2026
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 7993b5e to 8cdd3a1 Compare April 16, 2026 09:08
Comment thread core/bifrost.go
Comment thread core/bifrost.go
coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 16, 2026
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 8cdd3a1 to 225802b Compare April 16, 2026 09:55
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
core/bifrost.go (1)

6069-6108: ⚠️ Potential issue | 🔴 Critical

The one-shot final drain still leaves hung requests possible.

drainQueueWithErrors is still a non-blocking sweep after the workers exit. A producer that already holds pq, races past the last default: return, and then wins pq.queue <- msg lands a message in an open queue with no workers and no further drain pass, so the caller blocks on msg.Response / msg.Err until its context is cancelled. This narrows the old panic window, but it doesn't actually close the shutdown race. As per coding guidelines, "Use ProviderQueue pattern with atomic flags and sync.Once to prevent 'send on closed channel' panics when managing provider request queues."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost.go` around lines 6069 - 6108, drainQueueWithErrors currently
does a one-shot non-blocking sweep and can miss messages sent after it returns;
fix by adding a ProviderQueue-level closed flag (atomic.Bool) plus a sync.Once
close method and make producers check that flag before attempting pq.queue <-
msg (e.g. in tryRequest or the send path) so no new sends are allowed after
close is initiated; call the closeOnce/atomic store at the start of
drainQueueWithErrors (or in the same shutdown path that waits for workers) to
atomically prevent further sends to pq.queue, then keep the existing draining
logic to flush existing messages and return. Use the existing symbols:
ProviderQueue, pq.queue, drainQueueWithErrors, and the producer send site
(tryRequest) to implement the atomic closed flag and sync.Once close.
🧹 Nitpick comments (1)
core/bifrost_test.go (1)

1976-1998: Ordering differs from TestUpdateProvider_TransferOrdering — clarify step comments.

The updater loop here performs:

  1. signalClosing(oldPq) (line 1977)
  2. Drain oldPq → newPq (lines 1982-1995)
  3. Store newPq (line 1998)

But TestUpdateProvider_TransferOrdering (lines 1814-1838) expects the inverse: transfer before signalClosing().

For this test's goal (zero panics), either ordering is safe since the queue channel is never closed. However, the comment "performs UpdateProvider-style queue replacements" is misleading if the production UpdateProvider actually transfers before signaling.

Consider updating the step comments to clarify this is an alternative ordering for stress-testing purposes, or align with the expected production ordering:

-// Step 2: signal old queue closing so no new producers land on it.
-oldPq.signalClosing()
-
-// Step 3: drain oldPq into newPq before exposing newPq...
+// Note: For this stress test, we signal closing first so producers
+// see isClosing() == true and re-route. Production UpdateProvider
+// may use different ordering (transfer-then-signal).
+oldPq.signalClosing()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost_test.go` around lines 1976 - 1998, The test's updater loop
currently calls oldPq.signalClosing() before draining oldPq into newPq and then
does requestQueues.Store(provider, newPq), which conflicts with the ordering
assumed by TestUpdateProvider_TransferOrdering (which transfers before
signaling); either update the inline step comments to explicitly state this test
intentionally uses the alternative ordering for stress-testing (mentioning
oldPq.signalClosing(), the drain loop, and requestQueues.Store(provider,
newPq)), or change the sequence to match production by moving the
signalClosing() call to after the drain so the flow matches
TestUpdateProvider_TransferOrdering; keep the comment clear about which ordering
is being tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/bifrost.go`:
- Around line 3238-3240: Currently requestQueues.Store(providerKey, newPq)
publishes newPq before any workers are started, causing the transfer goroutine
to block if there are more messages than newPq can accept; fix by keeping newPq
local (do not call requestQueues.Store) until after createBaseProvider and the
new provider/workers are fully initialized and transferWaitGroup has been
satisfied, start the transfer goroutine only after workers are consuming
newPq.queue, and if any initialization or the provider-slice swap
(createBaseProvider, provider slice update) fails roll back or avoid storing the
entry (or remove it) so callers cannot enqueue into a visible queue with zero
workers; apply the same change to the other affected blocks around the transfer
goroutine sections noted (including the regions around lines 3252-3294 and
3317-3324).
- Around line 4772-4774: The code currently returns pooled ChannelMessage to the
pool on ctx.Done() by calling bifrost.releaseChannelMessage(msg), which can let
a late write to req.ResponseStream or req.Err leak stale data into the next
user; remove the call to bifrost.releaseChannelMessage(msg) from the case
<-ctx.Done() branch so the worker retains ownership of the ChannelMessage until
it finishes handling the request and explicitly calls
bifrost.releaseChannelMessage(msg) after it sends to req.ResponseStream or
req.Err; keep the newBifrostCtxDoneError return but ensure any code paths that
complete the worker (the sender to req.ResponseStream/req.Err) perform the
pool.Put/reset via releaseChannelMessage to avoid stale fields.

---

Duplicate comments:
In `@core/bifrost.go`:
- Around line 6069-6108: drainQueueWithErrors currently does a one-shot
non-blocking sweep and can miss messages sent after it returns; fix by adding a
ProviderQueue-level closed flag (atomic.Bool) plus a sync.Once close method and
make producers check that flag before attempting pq.queue <- msg (e.g. in
tryRequest or the send path) so no new sends are allowed after close is
initiated; call the closeOnce/atomic store at the start of drainQueueWithErrors
(or in the same shutdown path that waits for workers) to atomically prevent
further sends to pq.queue, then keep the existing draining logic to flush
existing messages and return. Use the existing symbols: ProviderQueue, pq.queue,
drainQueueWithErrors, and the producer send site (tryRequest) to implement the
atomic closed flag and sync.Once close.

---

Nitpick comments:
In `@core/bifrost_test.go`:
- Around line 1976-1998: The test's updater loop currently calls
oldPq.signalClosing() before draining oldPq into newPq and then does
requestQueues.Store(provider, newPq), which conflicts with the ordering assumed
by TestUpdateProvider_TransferOrdering (which transfers before signaling);
either update the inline step comments to explicitly state this test
intentionally uses the alternative ordering for stress-testing (mentioning
oldPq.signalClosing(), the drain loop, and requestQueues.Store(provider,
newPq)), or change the sequence to match production by moving the
signalClosing() call to after the drain so the flow matches
TestUpdateProvider_TransferOrdering; keep the comment clear about which ordering
is being tested.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e081032f-40d5-4ef9-92cd-58be67d2a9bd

📥 Commits

Reviewing files that changed from the base of the PR and between 7993b5e and 225802b.

📒 Files selected for processing (2)
  • core/bifrost.go
  • core/bifrost_test.go

Comment thread core/bifrost.go
Comment thread core/bifrost.go
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 225802b to d0f3b1c Compare April 16, 2026 10:34
akshaydeo
akshaydeo previously approved these changes Apr 16, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 16, 2026
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from d0f3b1c to 24896d4 Compare April 16, 2026 11:34
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
core/bifrost.go (1)

3145-3146: ⚠️ Potential issue | 🟠 Major

The final sweep still leaves a stranded-request window.

The helper comment at Line 6100 describes the remaining race: a producer that passed the closing gate before signalClosing() can still win pq.queue <- msg after this one-shot drain exits via default. At that point RemoveProvider, UpdateProvider, and Shutdown have no worker and no second sweep for that queue, so the caller waits on msg.Response/msg.Err until its own context is cancelled.

This fixes the panic, but shutdown/update still isn't deterministic under load. You'll need a sender quiesce/barrier or an equivalent always-on drainer before treating the queue as fully drained.

Also applies to: 3321-3322, 6100-6131, 6656-6660

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost.go` around lines 3145 - 3146, The final one-shot drain in
drainQueueWithErrors leaves a race where a producer that passed signalClosing()
can still send to pq.queue (pq.queue <- msg) after the default-exit drain,
leaving callers blocked on msg.Response/msg.Err; fix by introducing a sender
quiesce/barrier or an always-on drainer: add a quiesce phase or a
"no-new-senders" signal that producers must observe (or a producer WaitGroup)
before treating the queue drained, and replace the one-shot default-exit with a
persistent goroutine that continues draining pq.queue until a proper
closed+empty condition is reached; update signalClosing(), drainQueueWithErrors,
and the shutdown paths for RemoveProvider/UpdateProvider/Shutdown to coordinate
with the quiesce (or persistent drainer) so no msg can be sent after the closing
gate without a worker to complete msg.Response/msg.Err.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/bifrost.go`:
- Around line 3238-3240: The new per-provider queue (newPq) is stored into
bifrost.requestQueues (bifrost.requestQueues.Store(providerKey, newPq)) before
the CAS replacement loop can fail, leaving a workerless queue if
maxReplacementAttempts is reached; modify the failure path in the CAS loop (the
branch that gives up after maxReplacementAttempts) to undo the store by
removing/clearing the entry for providerKey (i.e., rollback newPq) so callers
cannot enqueue into a queue with no workers, and apply the same rollback logic
to the analogous failure path around the second occurrence (the block referenced
at lines ~3345-3351) to ensure both replacement code paths remove newPq when
they give up.

---

Duplicate comments:
In `@core/bifrost.go`:
- Around line 3145-3146: The final one-shot drain in drainQueueWithErrors leaves
a race where a producer that passed signalClosing() can still send to pq.queue
(pq.queue <- msg) after the default-exit drain, leaving callers blocked on
msg.Response/msg.Err; fix by introducing a sender quiesce/barrier or an
always-on drainer: add a quiesce phase or a "no-new-senders" signal that
producers must observe (or a producer WaitGroup) before treating the queue
drained, and replace the one-shot default-exit with a persistent goroutine that
continues draining pq.queue until a proper closed+empty condition is reached;
update signalClosing(), drainQueueWithErrors, and the shutdown paths for
RemoveProvider/UpdateProvider/Shutdown to coordinate with the quiesce (or
persistent drainer) so no msg can be sent after the closing gate without a
worker to complete msg.Response/msg.Err.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 21f8ac1d-5408-4c3a-8c3f-8274b9e756e9

📥 Commits

Reviewing files that changed from the base of the PR and between d0f3b1c and 24896d4.

📒 Files selected for processing (2)
  • core/bifrost.go
  • core/bifrost_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/bifrost_test.go

Comment thread core/bifrost.go
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 24896d4 to 5018b50 Compare April 16, 2026 11:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
core/bifrost_test.go (1)

1333-1337: Consider extracting a shared newTestProviderQueue helper to reduce drift.

ProviderQueue test setup is repeated many times with the same fields. Centralizing it will make future queue-lifecycle changes less error-prone.

♻️ Suggested refactor
+func newTestProviderQueue(bufferSize int) *ProviderQueue {
+	return &ProviderQueue{
+		queue:      make(chan *ChannelMessage, bufferSize),
+		done:       make(chan struct{}),
+		signalOnce: sync.Once{},
+	}
+}
- pq := &ProviderQueue{
- 	queue:      make(chan *ChannelMessage, 10),
- 	done:       make(chan struct{}),
- 	signalOnce: sync.Once{},
- }
+ pq := newTestProviderQueue(10)

Also applies to: 1429-1433, 1476-1480, 1502-1506, 1555-1559, 1636-1640, 1721-1730, 1793-1802, 1875-1879, 2035-2039, 2095-2099, 2166-2170, 2226-2230

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost_test.go` around lines 1333 - 1337, Several tests repeatedly
instantiate ProviderQueue with identical fields; extract a helper function named
newTestProviderQueue that returns *ProviderQueue configured with queue:
make(chan *ChannelMessage, 10), done: make(chan struct{}), and signalOnce:
sync.Once{} to centralize setup and reduce duplication; replace inline struct
literals in tests (the places creating &ProviderQueue{queue: make(chan
*ChannelMessage, 10), done: make(chan struct{}), signalOnce: sync.Once{}}) with
calls to newTestProviderQueue() so future lifecycle changes are made in one
location.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@core/bifrost_test.go`:
- Around line 1333-1337: Several tests repeatedly instantiate ProviderQueue with
identical fields; extract a helper function named newTestProviderQueue that
returns *ProviderQueue configured with queue: make(chan *ChannelMessage, 10),
done: make(chan struct{}), and signalOnce: sync.Once{} to centralize setup and
reduce duplication; replace inline struct literals in tests (the places creating
&ProviderQueue{queue: make(chan *ChannelMessage, 10), done: make(chan struct{}),
signalOnce: sync.Once{}}) with calls to newTestProviderQueue() so future
lifecycle changes are made in one location.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3e5e3419-e0a0-4658-a355-73bd91a132ae

📥 Commits

Reviewing files that changed from the base of the PR and between 24896d4 and 5018b50.

📒 Files selected for processing (2)
  • core/bifrost.go
  • core/bifrost_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/bifrost.go

coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 16, 2026
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 5018b50 to 99c287f Compare April 16, 2026 12:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/bifrost.go (1)

6135-6158: ⚠️ Potential issue | 🟠 Major

Reset the full pooled ChannelMessage before Put().

releaseChannelMessage() returns the object to channelMessagePool, but it keeps Context and the embedded BifrostRequest populated. That retains old request payloads/contexts in the pool and violates the repo’s pool-reset rule.

♻️ Suggested fix
 func (bifrost *Bifrost) releaseChannelMessage(msg *ChannelMessage) {
 	// Put channels back in pools
 	bifrost.responseChannelPool.Put(msg.Response)
 	bifrost.errorChannelPool.Put(msg.Err)
@@
 	// Clear references and return to pool
+	msg.BifrostRequest = schemas.BifrostRequest{}
+	msg.Context = nil
 	msg.Response = nil
 	msg.ResponseStream = nil
 	msg.Err = nil
 	bifrost.channelMessagePool.Put(msg)
 }

As per coding guidelines, "Always reset ALL fields of pooled objects before calling pool.Put() to prevent stale data from leaking to subsequent users."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost.go` around lines 6135 - 6158, The releaseChannelMessage function
returns a ChannelMessage to channelMessagePool but currently leaves msg.Context
and the embedded BifrostRequest (and any other request-related fields)
populated, which can leak stale data; before calling
bifrost.channelMessagePool.Put(msg) clear all remaining fields on msg — at
minimum set msg.Context = nil and zero out the embedded BifrostRequest (and any
other request/payload fields) after draining/returning channels and before Put —
so the pooled ChannelMessage contains no prior request state when reused.
♻️ Duplicate comments (1)
core/bifrost.go (1)

6094-6133: ⚠️ Potential issue | 🟠 Major

This still leaves a post-drain enqueue window.

The comment here is accurate, but it also describes a real unresolved bug: after the worker drain loop and this final sweep both exit on default, a producer that already passed the closing gate can still have select choose pq.queue <- msg over <-pq.done. That request has no worker and no later sweep, so it hangs until its own context is cancelled.

This is the same stale-producer race previously called out; the single extra sweep narrows it but does not close it. A sender quiesce/barrier is still needed before the last drain returns.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost.go` around lines 6094 - 6133, The final non-blocking sweep in
drainQueueWithErrors still leaves a race where a producer that passed the
closing gate can enqueue after the sweep and never get drained; fix by adding a
sender-side quiesce: add a sender counter/WG to ProviderQueue (e.g., senderWg or
activeSenders int32 + sync.WaitGroup) and update tryRequest to increment the
sender ref before checking/attempting the send and decrement after the send or
on context cancel; then change drainQueueWithErrors to wait for that senderWg to
reach zero before doing the final non-blocking sweep so no sender can enqueue
after the sweep exits. Ensure updates touch ProviderQueue struct, tryRequest
(the producer path), and drainQueueWithErrors to use the new sender quiesce
mechanism.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/bifrost.go`:
- Around line 3325-3334: The rollback path currently removes requestQueues and
waitGroups but leaves the stale provider in bifrost.providers, causing
GetConfiguredProviders() and getProviderByKey() to expose a dead provider and
allowing prepareProvider() to append a duplicate; update the failure branches
that call newPq.signalClosing(), bifrost.requestQueues.Delete(providerKey),
bifrost.waitGroups.Delete(providerKey), and bifrost.drainQueueWithErrors(newPq)
to also remove the provider entry from bifrost.providers (and any other
provider-index maps) for the same providerKey so the provider is fully removed
on error; apply the same change to both failure locations (the block around
drainQueueWithErrors and the similar block at 3346-3353) to ensure retries via
prepareProvider() start from a clean state.

---

Outside diff comments:
In `@core/bifrost.go`:
- Around line 6135-6158: The releaseChannelMessage function returns a
ChannelMessage to channelMessagePool but currently leaves msg.Context and the
embedded BifrostRequest (and any other request-related fields) populated, which
can leak stale data; before calling bifrost.channelMessagePool.Put(msg) clear
all remaining fields on msg — at minimum set msg.Context = nil and zero out the
embedded BifrostRequest (and any other request/payload fields) after
draining/returning channels and before Put — so the pooled ChannelMessage
contains no prior request state when reused.

---

Duplicate comments:
In `@core/bifrost.go`:
- Around line 6094-6133: The final non-blocking sweep in drainQueueWithErrors
still leaves a race where a producer that passed the closing gate can enqueue
after the sweep and never get drained; fix by adding a sender-side quiesce: add
a sender counter/WG to ProviderQueue (e.g., senderWg or activeSenders int32 +
sync.WaitGroup) and update tryRequest to increment the sender ref before
checking/attempting the send and decrement after the send or on context cancel;
then change drainQueueWithErrors to wait for that senderWg to reach zero before
doing the final non-blocking sweep so no sender can enqueue after the sweep
exits. Ensure updates touch ProviderQueue struct, tryRequest (the producer
path), and drainQueueWithErrors to use the new sender quiesce mechanism.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab83018b-f961-44e3-8992-e5bd265ba4c9

📥 Commits

Reviewing files that changed from the base of the PR and between 5018b50 and 99c287f.

📒 Files selected for processing (2)
  • core/bifrost.go
  • core/bifrost_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/bifrost_test.go

Comment thread core/bifrost.go
@Pratham-Mishra04 Pratham-Mishra04 force-pushed the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch from 99c287f to 8e2fab7 Compare April 16, 2026 13:05
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
core/bifrost.go (1)

6136-6142: ⚠️ Potential issue | 🟠 Major

Residual post-sweep enqueue can still strand requests.

This one-shot non-blocking sweep still allows a producer to enqueue after the final default exit, leaving the request without workers and waiting until caller cancellation. That’s a reliability gap during RemoveProvider/UpdateProvider/Shutdown when contexts are long-lived.

Please add sender quiescing (refcount/barrier) before final drain, or keep draining until producer activity is fully quiesced.

As per coding guidelines, "Use ProviderQueue pattern with atomic flags and sync.Once to prevent 'send on closed channel' panics when managing provider request queues."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost.go` around lines 6136 - 6142, The one-shot non-blocking sweep
can leave producers able to enqueue after the final `select { default: }` exit,
stranding requests; update the drain logic in the provider queue code (the sweep
invoked during `RemoveProvider`/`UpdateProvider`/`Shutdown`) to wait for sender
quiescence instead of a single non-blocking pass: add a sender-side reference
count/barrier (e.g., increment in send path and decrement on send completion)
and have the drain loop wait until that refcount reaches zero or a short
timeout, and protect queue closure with a `ProviderQueue` pattern using an
atomic closed flag and `sync.Once` to ensure no sends occur to a closed channel
and avoid race panics. Ensure you reference and modify the sweep/drain routine,
the send path that currently does the non-blocking `select { default: }`, and
the provider lifecycle callers (`RemoveProvider`, `UpdateProvider`, `Shutdown`)
so they trigger quiescing before the final drain.
🧹 Nitpick comments (2)
core/bifrost_test.go (1)

2223-2299: Rename this test or assert the drain outcome.

The body currently treats both successfulSends and shutdownErrors as valid, so it proves “no panic in the TOCTOU window” rather than “new producers get shutdown errors.” Renaming it to reflect that narrower guarantee, or draining/asserting the successfully enqueued messages, would make the test intent match its behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost_test.go` around lines 2223 - 2299, The test
TestRemoveProvider_ConcurrentNewProducersDuringShutdown currently allows both
successfulSends and shutdownErrors, so update it to assert the drain outcome:
after producerWg.Wait() drain pq.queue non-blockingly (read until channel empty)
counting drainedMessages, then assert atomic.LoadInt64(&successfulSends) ==
int64(drainedMessages) and optionally assert drainedMessages == 0 if the intent
is "new producers get shutdown errors"; reference
TestRemoveProvider_ConcurrentNewProducersDuringShutdown, pq.queue,
successfulSends and shutdownErrors when making the change.
core/bifrost.go (1)

3186-3192: Concurrency guarantee comment is currently stronger than the implementation.

Line 3187/3188 says all producers go through getProviderQueue read-locking, but stale-producer reroute paths use direct requestQueues.Load (Line 4465, Line 4732). Please reword this comment to avoid a misleading guarantee.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/bifrost.go` around lines 3186 - 3192, The comment on UpdateProvider
overstates the concurrency guarantee; update it to say that while producer paths
that call getProviderQueue do acquire the providerMutex read lock before looking
up/enqueueing (so they cannot see newPq until the write lock is released and
workers are started), there are other paths (notably the stale-producer reroute
code which performs direct requestQueues.Load) that bypass getProviderQueue and
its read-lock, so UpdateProvider should not claim an absolute prohibition of any
producer observing newPq before workers run; mention getProviderQueue and the
direct requestQueues.Load/stale-producer reroute paths by name to clarify the
exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@core/bifrost.go`:
- Around line 6136-6142: The one-shot non-blocking sweep can leave producers
able to enqueue after the final `select { default: }` exit, stranding requests;
update the drain logic in the provider queue code (the sweep invoked during
`RemoveProvider`/`UpdateProvider`/`Shutdown`) to wait for sender quiescence
instead of a single non-blocking pass: add a sender-side reference count/barrier
(e.g., increment in send path and decrement on send completion) and have the
drain loop wait until that refcount reaches zero or a short timeout, and protect
queue closure with a `ProviderQueue` pattern using an atomic closed flag and
`sync.Once` to ensure no sends occur to a closed channel and avoid race panics.
Ensure you reference and modify the sweep/drain routine, the send path that
currently does the non-blocking `select { default: }`, and the provider
lifecycle callers (`RemoveProvider`, `UpdateProvider`, `Shutdown`) so they
trigger quiescing before the final drain.

---

Nitpick comments:
In `@core/bifrost_test.go`:
- Around line 2223-2299: The test
TestRemoveProvider_ConcurrentNewProducersDuringShutdown currently allows both
successfulSends and shutdownErrors, so update it to assert the drain outcome:
after producerWg.Wait() drain pq.queue non-blockingly (read until channel empty)
counting drainedMessages, then assert atomic.LoadInt64(&successfulSends) ==
int64(drainedMessages) and optionally assert drainedMessages == 0 if the intent
is "new producers get shutdown errors"; reference
TestRemoveProvider_ConcurrentNewProducersDuringShutdown, pq.queue,
successfulSends and shutdownErrors when making the change.

In `@core/bifrost.go`:
- Around line 3186-3192: The comment on UpdateProvider overstates the
concurrency guarantee; update it to say that while producer paths that call
getProviderQueue do acquire the providerMutex read lock before looking
up/enqueueing (so they cannot see newPq until the write lock is released and
workers are started), there are other paths (notably the stale-producer reroute
code which performs direct requestQueues.Load) that bypass getProviderQueue and
its read-lock, so UpdateProvider should not claim an absolute prohibition of any
producer observing newPq before workers run; mention getProviderQueue and the
direct requestQueues.Load/stale-producer reroute paths by name to clarify the
exception.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c1ae4c33-0ca0-4499-b119-35c10d9b3bae

📥 Commits

Reviewing files that changed from the base of the PR and between 99c287f and 8e2fab7.

📒 Files selected for processing (3)
  • core/bifrost.go
  • core/bifrost_test.go
  • transports/bifrost-http/handlers/providers.go
💤 Files with no reviewable changes (1)
  • transports/bifrost-http/handlers/providers.go

Copy link
Copy Markdown
Collaborator Author

@greptile

Re: N-worker concurrent drain race

Go channels guarantee that each message is received by exactly one goroutine — this is a language-level invariant, not
a convention. When all N workers select on pq.done, whichever worker wins the pq.done case is responsible for draining;
the others move on. Even if multiple workers simultaneously enter the drain loop, each <-pq.queue dequeue is atomic
and exclusive — there is no double-receive. No shared mutable state is touched during the drain read itself, so
concurrent drainers are safe.

Re: post-drain TOCTOU window

This is a known, documented limitation. The window exists between a producer passing the isClosing() check and
depositing its message, after the drain sweep has already exited. The message will sit unread until the caller's
context deadline fires, at which point the ctx.Done() case in the response-wait select returns an error to the caller.
The window is therefore bounded by the request's own deadline — no request is stranded indefinitely. Eliminating it
entirely would require sender-side reference counting on the queue (increment before the isClosing() check, decrement
after enqueue, drain waits for zero) which adds overhead on every single enqueue in the hot path. We've judged the
bounded-by-deadline behavior acceptable for now and left the tradeoff documented in the drainQueueWithErrors comment.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 16, 2026

Both points are well-reasoned. My original concerns are addressed:

On concurrent drain: You're correct — channel receives in Go are atomic and mutually exclusive by the memory model. Multiple goroutines racing on <-pq.queue will each get distinct messages, never the same one. The "concurrent drainers" concern I raised was unfounded. Withdrawing that.

On post-drain TOCTOU: The bounded-by-deadline framing is the right mental model here. A message that slips through after the drain sweep isn't lost forever — it waits at most until ctx.Done() fires on the producer side. The tradeoff you've described (sender-side WaitGroup / refcount on the hot enqueue path) is real overhead that isn't justified when the deadline already provides the bound. The comment in drainQueueWithErrors documenting this tradeoff is the right call.

Both points accepted. LGTM on the concurrency reasoning.

Tip: You can give me more context with [.greptile/rules](https://www.greptile.com/docs/code-review/custom-standards) so I get this right next time.

@akshaydeo akshaydeo changed the title [DO NOT MERGE] fix: prevent send on closed channel panic in provider queue shutdown fix: prevent send on closed channel panic in provider queue shutdown --skip-pipeline Apr 17, 2026
Copy link
Copy Markdown
Contributor

akshaydeo commented Apr 17, 2026

Merge activity

  • Apr 17, 7:53 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Apr 17, 7:53 AM UTC: @akshaydeo merged this pull request with Graphite.

@akshaydeo akshaydeo merged commit 18427ac into main Apr 17, 2026
17 of 19 checks passed
@akshaydeo akshaydeo deleted the 04-15-fix_send_to_closed_channels_edge_case_race_condition_on_provider_reload branch April 17, 2026 07:53
dwjwlxs added a commit to dwjwlxs/bifrost that referenced this pull request Apr 20, 2026
* fix: delete fallbacks from anthropic req (#2754)

## Summary

Remove the `fallbacks` field from Anthropic provider request bodies to ensure compatibility with the Anthropic API specification.

## Changes

- Added logic to delete the `fallbacks` field from JSON request bodies in the Anthropic provider's `getRequestBodyForResponses` function
- Implemented proper error handling for the field deletion operation with appropriate Bifrost error wrapping

## Type of change

- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [ ] Core (Go)
- [ ] Transports (HTTP)
- [x] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Test Anthropic provider requests to ensure the `fallbacks` field is properly removed and requests succeed:

```sh
# Core/Transports
go version
go test ./...

# Test specific Anthropic provider functionality
go test ./core/providers/anthropic/...
```

Verify that requests to the Anthropic API no longer include the `fallbacks` field and complete successfully.

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

No security implications - this change only removes an unsupported field from API requests.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* fix: preserve context values in async requests (#2703)

## Summary

Refactors async job execution to pass the full BifrostContext instead of just the virtual key value, enabling proper context preservation for background operations including virtual keys, tracing headers, and other request metadata.

## Changes

- Modified `AsyncJobExecutor.SubmitJob()` to accept `*schemas.BifrostContext` instead of `*string` for virtual key
- Updated `executeJob()` to restore all original request context values in the background goroutine
- Added `getVirtualKeyFromContext()` helper function to extract virtual key from BifrostContext
- Updated all async handler methods to pass BifrostContext directly to `SubmitJob()`
- Removed redundant virtual key extraction logic from HTTP handlers

## Type of change

- [x] Refactor

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)

## How to test

Verify async job execution preserves request context properly:

```sh
# Core/Transports
go version
go test ./...

# Test async endpoints with virtual keys and tracing headers
curl -X POST http://localhost:8080/v1/async/chat/completions \
  -H "Authorization: Bearer vk_test_key" \
  -H "X-Trace-Id: test-trace-123" \
  -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello"}]}'

# Verify job execution maintains context
curl http://localhost:8080/v1/async/jobs/{job_id}
```

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

Improves security by ensuring proper context isolation and virtual key handling in async operations.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [x] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* [fix]: Gemini provider - handle content block tool outputs in Responses API path (#2692)

When function_call_output messages arrive via the Anthropic Responses API
format, their output is an array of content blocks (ResponsesFunctionToolCallOutputBlocks),
not a plain string (ResponsesToolCallOutputStr). The Gemini provider's
convertResponsesMessagesToGeminiContents only checked the string case,
silently dropping all tool result content and sending empty {} responses
to Gemini. This caused the model to loop endlessly retrying tool calls
it never saw results for.

Other providers (Bedrock, OpenAI, Cohere) already handle both output
formats. This aligns the Gemini provider with them.

Affected packages:
- core/providers/gemini/responses.go - Add ResponsesFunctionToolCallOutputBlocks handling
- core/providers/gemini/gemini_test.go - Add test for content block outputs

Co-authored-by: tom <tom@asteroid.ai>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Akshay Deo <akshay@akshaydeo.com>

* fix: gemini thinking level and finish reason round-trip preservation (#2697)

## Summary

Fixes two critical regressions in the Gemini provider's GenAI integration: preserves `thinkingLevel` parameters during round-trip conversions and ensures `MAX_TOKENS` finish reasons survive Bifrost transformations.

## Changes

- **Fixed thinking level preservation**: Modified `convertGenerationConfigToResponsesParameters()` to only set effort from `thinkingLevel` without deriving a `thinkingBudget`, preventing unwanted behavior changes in Gemini 3.x models
- **Enhanced finish reason handling**: Added bidirectional conversion between Gemini and Bifrost finish reasons, prioritizing `StopReason` over `IncompleteDetails` to preserve `MAX_TOKENS` finish reasons
- **Expanded finish reason support**: Added new Gemini finish reason constants for image generation, tool calls, and malformed responses
- **Improved response conversion**: Updated response conversion logic to properly handle error finish reasons and set appropriate status/error fields

## Type of change

- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [ ] Core (Go)
- [ ] Transports (HTTP)
- [x] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Validate the thinking level and finish reason preservation:

```sh
# Run Gemini provider tests
go test ./core/providers/gemini/... -v

# Specifically test the regression fixes
go test ./core/providers/gemini/... -run "TestGenAIThinkingLevel_RoundTripPreservesLevelNotBudget|TestGenAIFinishReasonMaxTokens_PersistsThroughBifrostRoundTrip" -v
```

Test with actual Gemini API calls using thinking levels and verify that:
- `thinkingLevel` parameters are preserved without generating unwanted `thinkingBudget` values
- Responses with `MAX_TOKENS` finish reason maintain that status through the conversion pipeline

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

Addresses regressions in GenAI path where thinking configuration and finish reasons were being incorrectly transformed during Bifrost conversions.

## Security considerations

No security implications - this change only affects internal data structure conversions and doesn't modify authentication, secrets handling, or data exposure.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* fix: remove cc user agent guard from streaming in anthropic (#2706)

## Summary

Fixes WebSearch tool argument handling for all clients by removing the Claude Code user agent restriction. Previously, only Claude Code clients received proper WebSearch query arguments in the streaming response, while other clients lost the query data due to skipped argument deltas.

## Changes

- Removed the `IsClaudeCodeRequest(ctx)` check that was restricting WebSearch argument sanitization and synthetic delta generation to only Claude Code clients
- WebSearch tool arguments are now sanitized and synthetic `input_json_delta` events are generated for all clients during `output_item.done` events
- Added comprehensive test coverage for the WebSearch tool flow including argument delta skipping, synthetic delta generation, and full end-to-end streaming scenarios
- Enhanced code comments to clarify the WebSearch tool handling logic

## Type of change

- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [ ] Core (Go)
- [ ] Transports (HTTP)
- [x] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Validate the WebSearch tool behavior with the new test suite:

```sh
# Run the new WebSearch tests
go test ./core/providers/anthropic -run TestWebSearch -v

# Run all provider tests to ensure no regressions
go test ./core/providers/anthropic/...

# Full test suite
go test ./...
```

Test with different user agents to verify WebSearch queries are properly streamed to all clients, not just Claude Code.

## Screenshots/Recordings

N/A - This is a backend streaming API fix.

## Breaking changes

- [ ] Yes
- [x] No

This change expands functionality to previously broken clients without affecting existing working behavior.

## Related issues

Fixes WebSearch tool argument streaming for non-Claude Code clients.

## Security considerations

The change maintains existing argument sanitization for WebSearch tools while expanding it to all clients, preserving the same security posture.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* remove unnecessary marshalling of payload (#2770)

## Summary

Optimized JSON parsing in the Anthropic integration by replacing full JSON unmarshaling with targeted field extraction using gjson for retrieving the "type" field from streaming responses.

## Changes

- Replaced `sonic.Unmarshal()` with `gjson.Get()` to extract only the "type" field from Anthropic stream events
- Eliminated the need to unmarshal the entire JSON response into an `AnthropicStreamEvent` struct
- Improved performance by avoiding unnecessary JSON parsing overhead

## Type of change

- [x] Refactor
- [ ] Bug fix
- [ ] Feature
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [ ] Core (Go)
- [x] Transports (HTTP)
- [x] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Test streaming responses from the Anthropic integration to ensure the type field is correctly extracted:

```sh
# Core/Transports
go version
go test ./...

# Test specifically the Anthropic integration
go test ./transports/bifrost-http/integrations/
```

## Screenshots/Recordings

N/A

## Breaking changes

- [x] No
- [ ] Yes

## Related issues

N/A

## Security considerations

No security implications - this is a performance optimization that maintains the same functionality.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* feat: claude opus 4.7 compatibility (#2773)

## Summary

Adds support for Claude Opus 4.7 model with specific parameter handling and reasoning configuration changes. Opus 4.7 rejects temperature, top_p, and top_k parameters and only supports adaptive thinking mode without budget tokens.

## Changes

- Added `IsOpus47()` function to detect Claude Opus 4.7 models
- Modified parameter handling to skip temperature, top_p, and top_k for Opus 4.7 models
- Updated reasoning configuration to use adaptive thinking only for Opus 4.7 (no budget_tokens)
- Added support for `display` parameter in thinking configuration to control output visibility
- Extended adaptive thinking support to include Sonnet 4.6 models
- Added task budget support with new beta header `task-budgets-2026-03-13`
- Updated effort mapping to handle Opus 4.7's "xhigh" effort level
- Added comprehensive test coverage for Opus 4.7 specific behaviors
- Fixed OpenAI responses to filter out Anthropic-specific summary:"none" parameter

## Type of change

- [x] Feature
- [x] Bug fix

## Affected areas

- [x] Core (Go)
- [x] Providers/Integrations

## How to test

Validate the changes with the following tests:

```sh
# Core/Transports
go version
go test ./core/providers/anthropic/...

# Specific test cases for Opus 4.7
go test -run TestToAnthropicChatRequest_Opus47 ./core/providers/anthropic/
go test -run TestSupportsAdaptiveThinking ./core/providers/anthropic/
go test -run TestAddMissingBetaHeadersToContext_TaskBudgets ./core/providers/anthropic/
```

Test with Claude Opus 4.7 model requests to ensure:
- Temperature, top_p, top_k parameters are stripped
- Reasoning uses adaptive thinking without budget_tokens
- Task budget beta headers are properly added

## Breaking changes

- [ ] Yes
- [x] No

The changes maintain backward compatibility while adding new model support.

## Security considerations

No security implications. Changes only affect parameter handling and model-specific configurations for Anthropic's Claude models.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [x] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* docs: restructure helm guide into comprehensive multi-page reference (#2776)

* docs: restructure helm guide into comprehensive multi-page reference (#2771)

## Summary

Restructures the Helm deployment documentation into a comprehensive multi-page guide with dedicated sections for each configuration area. The main Helm page now provides quickstart instructions for both OSS and Enterprise deployments, while detailed configuration is split into focused sub-pages.

## Changes

- **Restructured main Helm page**: Condensed from 740+ lines to 103 lines with clear quickstart tabs for OSS vs Enterprise
- **Added 8 new dedicated configuration pages**:
  - `values.mdx` - Complete values reference with examples and common patterns
  - `client.mdx` - Client configuration (pool size, logging, CORS, auth, compat shims)
  - `providers.mdx` - Provider setup for all 23+ supported LLM providers with cloud-native auth
  - `storage.mdx` - Storage backends (SQLite, PostgreSQL, object storage, vector stores)
  - `plugins.mdx` - Plugin configuration (telemetry, logging, semantic cache, OTel, Datadog)
  - `governance.mdx` - Governance setup (budgets, rate limits, virtual keys, routing rules)
  - `cluster.mdx` - Multi-replica HA with gossip-based peer discovery
  - `troubleshooting.mdx` - Common issues and diagnostic commands
- **Updated chart version**: Bumped from 1.5.0 to 2.1.0
- **Enhanced navigation**: Added nested Helm section in docs.json with proper icons and organization

## Type of change

- [x] Documentation

## Affected areas

- [x] Docs

## How to test

Navigate through the new Helm documentation structure:

1. Visit the main Helm page for quickstart instructions
2. Follow the quickstart for either OSS or Enterprise deployment
3. Use the sub-pages for detailed configuration of specific areas
4. Verify all internal links work correctly
5. Test the troubleshooting commands on a real deployment

The documentation now provides both quick-start paths and comprehensive reference material for production deployments.

## Screenshots/Recordings

N/A - Documentation changes only

## Breaking changes

- [ ] Yes
- [x] No

This is purely a documentation restructure with no functional changes to the Helm chart itself.

## Related issues

Improves Helm documentation organization and usability for both new users and production deployments.

## Security considerations

The new documentation emphasizes security best practices:
- Kubernetes Secrets for all sensitive values
- Cloud-native authentication (IRSA, Workload Identity, Managed Identity)
- Proper RBAC setup for cluster mode
- Compliance considerations (HIPAA, PCI) for content logging

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [x] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* docs: update guardrails docs

* v1.4.23 cut (#2778)

## Summary

Release version 1.4.20-1.4.23 with bug fixes for provider integrations, streaming error handling, and migration test improvements. This release addresses critical issues in Gemini, Bedrock, and Anthropic providers while adding support for Claude Opus 4.7.

## Changes

- **Provider Fixes**: Fixed Gemini tool outputs handling, Bedrock streaming events, and image content preservation in tool results
- **Streaming Improvements**: Added proper error capture for Responses streaming API to prevent silent failures
- **Migration Tests**: Added support for v1.4.22 governance model pricing flex tier columns in both PostgreSQL and SQLite migration tests
- **Anthropic Enhancements**: Removed fallback fields from outgoing requests and added Claude Opus 4.7 compatibility
- **Framework Fixes**: Improved async context propagation and custom provider model validation
- **Plugin Updates**: Enhanced OTEL metrics and configuration defaults

## Type of change

- [x] Bug fix
- [x] Feature
- [ ] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)
- [x] Providers/Integrations
- [x] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Validate the migration test changes and provider fixes:

```sh
# Test migration scripts
./.github/workflows/scripts/run-migration-tests.sh

# Core/Transports
go version
go test ./...

# Test provider integrations
go test ./transports/...
go test ./plugins/...
```

Test the new governance model pricing columns are properly handled in migration scenarios.

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

Addresses multiple provider integration issues and streaming API error handling improvements.

## Security considerations

No security implications - changes are focused on bug fixes and migration test improvements.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [x] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* validator fix (#2780)

## Summary

Enhanced GitHub Actions security by transitioning from audit-only to strict network egress control using step-security/harden-runner. This change blocks all outbound network traffic by default and explicitly allows only required endpoints for each workflow.

## Changes

- Changed `egress-policy` from `audit` to `block` across all GitHub Actions workflows
- Added comprehensive `allowed-endpoints` lists for each job, specifying only the necessary external services
- Updated step names from "Harden the runner (Audit all outbound calls)" to "Harden Runner" for consistency
- Fixed schema validation script to use correct JSON paths for concurrency and SCIM configuration validation
- Reformatted JSON schema file for improved readability (whitespace and formatting changes only)

## Type of change

- [x] Chore/CI

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)
- [x] Providers/Integrations
- [x] Plugins
- [x] UI (Next.js)
- [x] Docs

## How to test

Verify that all GitHub Actions workflows continue to function properly with the new network restrictions:

```sh
# Trigger workflows by pushing to a branch or creating a PR
git push origin feature-branch

# Monitor workflow runs in GitHub Actions tab to ensure:
# - All jobs complete successfully
# - No network connectivity errors occur
# - All required external services remain accessible
```

Key endpoints that should remain accessible include:
- GitHub API and release assets
- Package registries (npm, PyPI, Go modules)
- Docker registries
- Cloud storage services
- External APIs used by tests and integrations

## Screenshots/Recordings

N/A - Infrastructure/CI changes only

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

This change significantly improves security posture by:
- Preventing unauthorized outbound network connections from CI runners
- Creating an explicit allowlist of required external services
- Reducing attack surface for supply chain attacks
- Providing better visibility into network dependencies

The transition from audit to block mode ensures that any new network dependencies must be explicitly approved and documented.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [x] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* fix: token usage for vllm --skip-pipeline (#2784)

## Summary

Fixed token usage attribution for vLLM by treating empty-string content the same as nil content in streaming responses. vLLM sends `delta.content=""` (instead of `delta: null`) in finish_reason chunks, which was being forwarded and causing the synthesis chunk to lose its finish_reason, breaking usage attribution in logs and UI.

## Changes

- Modified streaming content handling to check for both nil and empty string content before processing chunks
- This prevents empty content deltas from being forwarded, ensuring finish_reason is preserved for proper token usage tracking
- Removed extraneous whitespace and formatting inconsistencies throughout the OpenAI provider code

## Type of change

- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [x] Core (Go)
- [ ] Transports (HTTP)
- [ ] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Test with vLLM provider to ensure token usage is properly attributed:

```sh
# Core/Transports
go version
go test ./...

# Test streaming chat completion with vLLM
# Verify that finish_reason is preserved in final chunks
# Check that token usage appears correctly in logs/UI
```

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

Fixes token usage tracking issues with vLLM provider.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [x] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* [fix]: OpenAI provider - flatten array-form tool_result output for Responses API (#2781) --skip-pipeline

When Anthropic tool_result blocks arrive with array-form content (the
standard shape for multi-turn tool exchanges), the OpenAI provider's
MarshalJSON emitted the output as a JSON array on the wire. The OpenAI
Responses API defines function_call_output.output as a string — strict
upstreams (Ollama Cloud, openai-go typed models) reject the array form
with HTTP 400.

Fix: before marshaling, collapse text-only
ResponsesFunctionToolCallOutputBlocks into a newline-joined string.
Non-text blocks (images, files) are left as-is. The schema type is
unchanged; the transformation lives in the OpenAI provider's outbound
marshaler only.

Closes #2779

Affected packages:
- core/providers/openai/types.go - Flatten text-only output blocks to string
- core/providers/openai/responses_marshal_test.go - Three regression tests
- core/changelog.md - Changelog entry

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: prevent send on closed channel panic in provider queue shutdown --skip-pipeline (#2725)

## Summary

Fixes a race condition in provider queue shutdown that caused "send on closed channel" panics in production. The issue occurred when producers passed the `isClosing()` check but then attempted to send to a queue that was closed before they reached the select statement.

## Changes

- **Removed queue channel closure**: Queue channels are never closed to prevent "send on closed channel" panics
- **Updated worker exit mechanism**: Workers now exit via the `done` channel signal instead of waiting for queue closure
- **Enhanced shutdown handling**: Workers drain remaining buffered requests and send shutdown errors when `done` is signaled
- **Added producer re-routing**: Stale producers can transparently re-route to new queues during `UpdateProvider`
- **Improved error handling**: Added rollback logic for failed provider updates with proper cleanup
- **Enhanced transfer logic**: Buffered requests are transferred before signaling shutdown to ensure they reach new workers
- **Added comprehensive tests**: Race condition demonstration and validation of the fix across multiple scenarios

## Type of change

- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [x] Core (Go)
- [ ] Transports (HTTP)
- [ ] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Run the new race condition test to verify the fix:

```sh
go test -run TestProviderQueue_SendOnClosedChannel_Race ./core -v
```

Run the comprehensive provider lifecycle tests:

```sh
go test -run TestProviderQueue ./core -v
go test -run TestUpdateProvider ./core -v
go test -run TestRemoveProvider ./core -v
```

Run the full test suite to ensure no regressions:

```sh
go test ./...
```

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

Fixes production panics related to concurrent provider queue operations during shutdown/updates.

## Security considerations

None - this is an internal concurrency fix that doesn't affect external interfaces or data handling.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* feat: preserve MCP tool annotations in bidirectional conversion --skip-pipeline (#2746)

## Summary

Adds support for preserving MCP tool annotations when converting between MCP tools and Bifrost schemas. This enables MCP servers to provide behavioral hints (read-only, destructive, idempotent, open-world) that help agents make better reasoning decisions about tool usage.

## Changes

- Added `MCPToolAnnotations` struct to capture MCP spec hints including title, read-only, destructive, idempotent, and open-world indicators
- Modified `convertMCPToolToBifrostSchema` to preserve MCP tool annotations when converting from MCP tools to Bifrost chat tools
- Updated `ChatToolFunction` to include optional annotations field
- Enhanced MCP server sync logic to map Bifrost annotations back to MCP tool annotations for bidirectional compatibility

## Type of change

- [x] Feature
- [ ] Bug fix
- [ ] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)
- [ ] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Test with an MCP server that provides tool annotations to verify they are preserved through the conversion process:

```sh
# Core/Transports
go version
go test ./...

# UI
cd ui
pnpm i || npm i
pnpm test || npm test
pnpm build || npm run build
```

Verify that MCP tools with annotations maintain their behavioral hints when converted to Bifrost schemas and back.

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

No security implications - this change only preserves metadata hints that help with tool behavior classification.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* fix: add support for Anthropic structured output and response format (#1972)

* fix: add support for Anthropic structured output and response format conversion

* fix: refactor output configuration setting in ToBedrockResponsesRequest

* run go fmt on responses.go

* fix: streamline response format conversion for Anthropic models

* fix: enhance merging of additional model request fields and output configuration

* fix: remove koanf/maps dependency and replace its usage with internal merge function

* preserve order in output_config

* update type casting

* add non-anthropic test-case

* check for output_config first

* diversify anthropic output formats

* move bifrost ctx update

* guard tested field

* guard format.jsonschema

* test fixes --skip-pipeline (#2782)

## Summary

Updates test configurations to align with current API specifications and replaces deprecated utility function usage.

## Changes

- Replaced `schemas.Ptr("test")` with `new("test")` in Anthropic chat test for string pointer creation
- Updated MCP client configuration tests to use `sse` connection type instead of `websocket` with simplified `connection_string` field
- Modified HTTP MCP client config to use `connection_string` instead of nested `http_config` object
- Changed OpenTelemetry plugin tests to use `genai_extension` trace type instead of `otel`

## Type of change

- [x] Chore/CI

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)
- [x] Providers/Integrations

## How to test

Validate that all tests pass with the updated configurations:

```sh
# Core/Transports
go version
go test ./...
```

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

No security implications - these are test configuration updates only.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* anthropic container changes --skip-pipeline (#2783)

## Summary

Briefly explain the purpose of this PR and the problem it solves.

## Changes

- What was changed and why
- Any notable design decisions or trade-offs

## Type of change

- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [ ] Core (Go)
- [ ] Transports (HTTP)
- [ ] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Describe the steps to validate this change. Include commands and expected outcomes.

```sh
# Core/Transports
go version
go test ./...

# UI
cd ui
pnpm i || npm i
pnpm test || npm test
pnpm build || npm run build
```

If adding new configs or environment variables, document them here.

## Screenshots/Recordings

If UI changes, add before/after screenshots or short clips.

## Breaking changes

- [ ] Yes
- [ ] No

If yes, describe impact and migration instructions.

## Related issues

Link related issues and discussions. Example: Closes #123

## Security considerations

Note any security implications (auth, secrets, PII, sandboxing, etc.).

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* core schema changes --skip-pipeline (#2787)

## Summary

Promotes Anthropic-native parameters to the neutral ChatParameters layer, enabling direct access to advanced Anthropic features like containers, MCP servers, task budgets, and enhanced tool configurations without requiring ExtraParams.

## Changes

- Added neutral fields to `ChatParameters` for Anthropic-specific features: `TopK`, `Speed`, `InferenceGeo`, `MCPServers`, `Container`, `CacheControl`, `TaskBudget`, and `ContextManagement`
- Enhanced `ChatTool` with Anthropic tool flags: `DeferLoading`, `AllowedCallers`, `InputExamples`, and `EagerInputStreaming`
- Added `Display` field to `ChatReasoning` for Anthropic adaptive thinking control
- Implemented `StripUnsupportedAnthropicFields` function to remove unsupported features based on provider capabilities
- Updated parameter mapping logic to prefer neutral fields over ExtraParams with fallback support
- Added comprehensive JSON marshaling/unmarshaling for union types like `ChatContainer`

The design maintains backward compatibility by falling back to ExtraParams when neutral fields are not set, while providing type-safe access to advanced Anthropic features.

## Type of change

- [x] Feature
- [ ] Bug fix
- [ ] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [x] Core (Go)
- [ ] Transports (HTTP)
- [x] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Validate the new parameter handling and provider feature gating:

```sh
# Core/Transports
go version
go test ./...

# Test Anthropic provider parameter mapping
go test ./core/providers/anthropic/...

# Verify schema validation
go test ./core/schemas/...
```

Test with requests containing the new neutral fields to ensure proper mapping to Anthropic API format and appropriate stripping for unsupported providers.

## Screenshots/Recordings

N/A - Backend API changes only.

## Breaking changes

- [ ] Yes
- [x] No

This change is fully backward compatible. Existing ExtraParams usage continues to work, while new neutral fields provide enhanced type safety.

## Related issues

N/A

## Security considerations

The new MCP server configuration includes authorization tokens. Ensure proper handling of sensitive credentials in the `ChatMCPServer.AuthorizationToken` field.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [x] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* dependabot fixes --skip-pipeline (#2788)

## Summary

This PR adds the Hono web framework as a direct dependency to all MCP server examples and updates various dependencies across the project to their latest versions.

## Changes

- Added `hono@^4.12.14` as a direct dependency to all MCP server examples (edge-case-server, error-test-server, parallel-test-server, temperature, test-tools-server)
- Upgraded Hono from version 4.11.4 to 4.12.14 and changed it from a peer dependency to a direct dependency
- Updated Python dependencies including authlib (1.6.6 → 1.6.11), langchain-core (1.2.28 → 1.2.31), langchain-openai (1.1.4 → 1.1.14), langchain-text-splitters (1.1.0 → 1.1.2), langsmith (0.5.0 → 0.7.32), openai (2.13.0 → 2.32.0), and python-multipart (0.0.20 → 0.0.26)
- Updated TypeScript dependencies including langsmith (0.5.18 → 0.5.19) and added it as a direct dependency
- Added `github.com/tidwall/gjson v1.18.0` as a direct dependency in Go transports module
- Updated UI dependencies including dompurify (3.3.3 → 3.4.0) and follow-redirects (1.15.11 → 1.16.0) via package overrides

## Type of change

- [x] Chore/CI

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)
- [x] Providers/Integrations
- [x] UI (Next.js)

## How to test

Validate that all dependencies are properly installed and examples still function:

```sh
# Test MCP server examples
cd examples/mcps/temperature
npm install
npm run build

# Test Go transports
cd transports
go mod tidy
go test ./...

# Test Python integrations
cd tests/integrations/python
uv sync
uv run python -m pytest

# Test TypeScript integrations
cd tests/integrations/typescript
npm install
npm test

# Test UI
cd ui
pnpm install
pnpm test
pnpm build
```

## Screenshots/Recordings

N/A - dependency updates only

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

The dependency updates include security patches, particularly for dompurify and follow-redirects which are explicitly overridden in the UI package.json for security reasons.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [x] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* move back go to 1.26.1 (#2792)

## Summary

Downgrade Go version from 1.26.2 to 1.26.1 across all GitHub Actions workflows, Go modules, and Docker images to address compatibility issues.

## Changes

- Downgraded Go version from 1.26.2 to 1.26.1 in all GitHub Actions workflows (e2e-tests, pr-tests, release-cli, release-pipeline, snyk)
- Updated go.mod files for core, CLI, examples, and test modules to use Go 1.26.1
- Updated Docker base images in transports/Dockerfile and transports/Dockerfile.local to use golang:1.26.1-alpine3.23
- Added stream cancellation safety improvements with guarded channel sends and finalizer protection to prevent goroutine leaks when clients disconnect
- Enhanced stream error checking with context cancellation support to properly drain upstream channels

## Type of change

- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)
- [x] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Validate the Go version downgrade and streaming improvements:

```sh
# Verify Go version
go version

# Core/Transports
go test ./...

# Test streaming endpoints with client disconnection scenarios
# to verify proper cleanup and no goroutine leaks
```

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

The streaming improvements enhance resource cleanup and prevent potential goroutine leaks when clients disconnect unexpectedly, improving overall system stability.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [x] I added/updated tests where appropriate
- [x] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* temp gotoolchain auto (#2809)

* temp hack for tests (#2810)

## Summary

The Go workspace setup script was not specifying a `go` directive or toolchain version, which caused `GOTOOLCHAIN=auto` to select a Go version lower than what `core@v1.4.19` requires. This adds an explicit `go 1.26.2` and `toolchain go1.26.2` directive to the workspace so the correct toolchain is used automatically.

## Changes

- Added `go work edit -go=1.26.2 -toolchain=go1.26.2` to `setup-go-workspace.sh` so that `GOTOOLCHAIN=auto` selects Go >= 1.26.2, satisfying the minimum version required by the published `core@v1.4.19` module referenced in `transports/go.mod`.

## Type of change

- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [ ] Core (Go)
- [x] Transports (HTTP)
- [ ] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

```sh
# Verify the workspace is initialized with the correct Go version
bash .github/workflows/scripts/setup-go-workspace.sh
grep -E "^go |^toolchain" go.work
# Expected output:
# go 1.26.2
# toolchain go1.26.2

go test ./...
```

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

## Security considerations

None.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* temp block docker build (#2811)

## Summary

Temporarily disables the `test-docker-image-amd64` and `test-docker-image-arm64` CI jobs in the release pipeline by commenting them out.

## Changes

- Both Docker image test jobs (`test-docker-image-amd64` and `test-docker-image-arm64`) have been commented out rather than removed, preserving the full job definitions for easy re-enablement later.

## Type of change

- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [ ] Core (Go)
- [ ] Transports (HTTP)
- [ ] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

No functional code changes. Verify the release pipeline runs without executing the Docker image test jobs.

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

No security implications.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* removed docker build steps (#2812)

## Summary

The `test-docker-image-amd64` and `test-docker-image-arm64` CI jobs have been removed from the release pipeline. These jobs were already commented out and non-functional, and all references to them as dependencies and gate conditions in downstream release jobs have been cleaned up.

## Changes

- Deleted the commented-out `test-docker-image-amd64` and `test-docker-image-arm64` job definitions from the release pipeline.
- Removed `test-docker-image-amd64` and `test-docker-image-arm64` from the `needs` arrays of `core-release`, `framework-release`, `plugins-release`, `bifrost-http-release`, the Docker build/push jobs, the manifest job, and the final notification job.
- Removed the corresponding result checks for those two jobs from all `if` conditions in the affected release jobs.

## Type of change

- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [ ] Core (Go)
- [ ] Transports (HTTP)
- [ ] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Trigger the release pipeline and confirm that all release jobs proceed without waiting on or referencing the removed Docker image test jobs.

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

## Security considerations

None.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* moves tests to 1.26.2 and 1.26.1 (#2813)

## Summary

Bumps the Go version used across all release pipeline jobs from `1.26.1` to `1.26.2` to keep the CI environment on the latest patch release.

## Changes

- Updated Go version from `1.26.1` to `1.26.2` in all `setup-go` steps within the release pipeline workflow.

## Type of change

- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [ ] Core (Go)
- [ ] Transports (HTTP)
- [ ] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

The release pipeline will use the updated Go version on the next run. No additional manual steps are required beyond verifying the CI pipeline passes.

```sh
go version
go test ./...
```

## Screenshots/Recordings

N/A

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

Patch releases often include security and bug fixes. Staying on the latest patch version reduces exposure to known vulnerabilities in the Go toolchain.

## Checklist

- [x] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [x] I verified builds succeed (Go and UI)
- [x] I verified the CI pipeline passes locally if applicable

* ocr test fixes (#2814)

## Summary

Adds an operation-allowed check for OCR requests before they are dispatched to a provider, and fixes the Mistral provider to return its custom provider name when one is configured.

## Changes

- Added a `CheckOperationAllowed` guard for `OCRRequest` in `handleProviderRequest`, consistent with how other request types are gated. If the operation is not permitted, a `BifrostError` is returned with the provider key, request type, and requested model populated.
- Updated `MistralProvider.GetProviderKey()` to use `providerUtils.GetProviderName` so that custom provider configurations are respected, rather than always returning the hardcoded `schemas.Mistral` value.

## Type of change

- [ ] Bug fix
- [x] Feature
- [ ] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [x] Core (Go)
- [ ] Transports (HTTP)
- [x] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

```sh
go version
go test ./...
```

- Configure a custom provider wrapping Mistral and verify that `GetProviderKey()` returns the custom provider name rather than `mistral`.
- Attempt an OCR request against a provider where the operation is not allowed and confirm a `BifrostError` is returned with the correct `Provider`, `RequestType`, and `ModelRequested` fields set.
- Attempt an OCR request against a provider where the operation is allowed and confirm the request proceeds normally.

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

## Security considerations

None.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* revert to old schema (#2815)

## Summary

This PR simplifies and consolidates the `config.schema.json` by removing several features, collapsing provider-specific schema variants, and restructuring key configuration definitions to reduce complexity and align with updated runtime semantics.

## Changes

- Removed the top-level `version` field that controlled allow-list semantics for empty arrays
- Removed the `compat` plugin configuration block (including `convert_text_to_chat`, `convert_chat_to_responses`, `should_drop_params`, `should_convert_params`)
- Replaced `compat` with a simpler `enable_litellm_fallbacks` boolean for Groq text completion fallbacks
- Removed `mcp_disable_auto_tool_inject` and `routing_chain_max_depth` from server config
- Collapsed `provider_with_ollama_config`, `provider_with_sgl_config`, and `provider_with_replicate_config` into the generic `provider` definition; removed their corresponding key types (`ollama_key`, `sgl_key`, `replicate_key`) and `network_config_without_base_url`
- Removed providers `nebius`, `xai`, and `runway` from the providers block
- Moved `calendar_aligned` from `virtual_key` to the `budget` object; removed `virtual_key_id` and `provider_config_id` from budget in favor of a standalone `budget_id` reference on virtual keys
- Removed `chain_rule` from routing rules and relaxed the `scope_id` conditional requirement
- Simplified `virtual_key_provider_config` to inline key definitions with full provider-specific key configs (Azure, Vertex, Bedrock, VLLM), replacing the separate `key_ids` and `keys` split
- Removed `mcp_client_name` and `allow_on_all_virtual_keys` from MCP configs; removed `allowed_extra_headers` and `disable_auto_tool_inject` from MCP client config
- Added `websocket` as a supported MCP connection type with a dedicated `websocket_config` block; removed `inprocess` connection type
- Removed `per_user_oauth` as an MCP auth type and dropped the conditional `oauth_config_id` requirement
- Renamed `concurrency_and_buffer_size` to `concurrency_config`; renamed `retry_backoff_initial`/`retry_backoff_max` to `retry_backoff_initial_ms`/`retry_backoff_max_ms`; removed `enforce_http2` and `openai_config` from network config
- Moved `pricing_overrides` from the top-level config into individual provider definitions
- Simplified `provider_pricing_override` schema, removing scoped fields (`scope_kind`, `virtual_key_id`, `provider_id`, `provider_key_id`) and replacing `pattern` with `model_pattern`; added `regex` as a valid `match_type`; expanded supported `request_types`
- Renamed `scim_config` to `saml_config` in the top-level schema
- Removed `apiToken` from Okta config and made `clientSecret` optional; updated required fields to only `issuerUrl` and `clientId`
- Removed `object_storage` and `retention_days` from the logs store config
- Removed `id` and `description` fields from provider config entries in the `provider_configs` array
- Removed `websocket_responses` and `realtime` from `custom_provider_config` allowed requests; removed the enum constraint on `base_provider_type`
- Removed `disable_auto_tool_inject` from `mcp_client_config` VFS settings
- Added `deployments` mapping to `azure_key_config` and `vertex_key_config`
- Updated `otel` plugin `trace_type` to only accept `"otel"` (removed `genai_extension`, `vercel`, `open_inference`)
- Removed `prompts` from the built-in plugin name list
- Removed `builtin` as a valid plugin `placement` value
- Changed `cluster_config` discovery `dial_timeout` from a Go duration string to an integer (nanoseconds)
- Reformatted many inline `required` arrays to multi-line style for readability

## Type of change

- [ ] Bug fix
- [ ] Feature
- [x] Refactor
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [ ] Core (Go)
- [x] Transports (HTTP)
- [x] Providers/Integrations
- [x] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

Validate existing configs against the updated schema to confirm they parse correctly. Verify that configs using removed fields (`version`, `compat`, `mcp_disable_auto_tool_inject`, `chain_rule`, etc.) are rejected by the schema validator.

```sh
go test ./...
```

Confirm that provider configs for Ollama, SGL, and Replicate continue to work using the generic `provider` definition. Confirm MCP clients using `websocket` connection type validate correctly with a `websocket_config` block.

## Breaking changes

- [x] Yes
- [ ] No

The following fields have been removed and configs using them will fail schema validation:

- `version` (top-level)
- `compat` block under server config
- `mcp_disable_auto_tool_inject` and `routing_chain_max_depth` under server config
- `chain_rule` on routing rules
- `calendar_aligned` on virtual keys (now on budgets)
- `virtual_key_id` / `provider_config_id` on budgets
- `apiToken` on Okta config (now optional `clientSecret` only)
- `object_storage` and `retention_days` on logs store
- `id`, `description` on provider config entries
- `allow_on_all_virtual_keys`, `allowed_extra_headers`, `disable_auto_tool_inject` on MCP client config
- `inprocess` MCP connection type and `per_user_oauth` auth type
- `enforce_http2` and `openai_config` from network config
- `builtin` plugin placement value; `prompts` built-in plugin name
- `nebius`, `xai`, `runway` provider entries

Migrate by removing or replacing these fields according to the updated schema definitions.

## Related issues

## Security considerations

Removal of `per_user_oauth` as an MCP auth type should be reviewed to ensure no active integrations depend on it. The relaxed `scope_id` requirement on routing rules should be validated to confirm it does not inadvertently broaden access scope.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* reduced release pipeline for this cut for go downgrade (#2816)

## Summary

This PR removes all test jobs from the release pipeline and decouples them from the release gate conditions, allowing releases to proceed without waiting for (often flaky) provider API test results. It also significantly expands and restructures `config.schema.json` to reflect new features, provider support, and breaking semantic changes introduced in v1.5.0.

## Changes

- **Release pipeline**: Removed `test-core`, `approve-flaky-test-core`, `test-framework`, `test-plugins`, `test-bifrost-http`, `test-migrations`, and `test-e2e-ui` jobs entirely from `release-pipeline.yml`. All release jobs (`core-release`, `framework-release`, `plugins-release`, `bifrost-http-prep`, `docker-build-amd64`, `docker-build-arm64`, `push-mintlify-changelog`) now depend only on change detection and upstream release jobs, not on test outcomes.

- **Schema: deny-by-default semantics (v1.5.0)**: Empty arrays in `provider_configs`, `mcp_configs`, `allowed_models`, `key_ids`, and `tools_to_execute` now mean "deny all" rather than "allow all". Use `["*"]` to allow all. A top-level `version` field (enum `1` or `2`, default `2`) controls which semantic applies, with `1` restoring v1.4.x behavior.

- **Schema: new providers**: Added `nebius`, `xai`, and `runway` as first-class provider entries.

- **Schema: provider key restructuring**: Replaced the inline key object definition in `virtual_key_provider_config` with a flat `key_ids` string array. Introduced dedicated key types `ollama_key`, `sgl_key`, and `replicate_key` with their own `_key_config` blocks. Removed `deployments` from `azure_key_config` and `vertex_key_config` (replaced by `aliases` on `base_key`). Added `aliases` to `base_key` for model-to-deployment/inference-profile mappings.

- **Schema: provider variants**: `ollama` and `sgl` now reference `provider_with_ollama_config` and `provider_with_sgl_config` respectively, which use `network_config_without_base_url` (URL is per-key). `replicate` references `provider_with_replicate_config`. Added `openai_config` def with `disable_store` for the Responses API. Renamed `concurrency_config` to `concurrency_and_buffer_size`.

- **Schema: network config**: Split `network_config` into `network_config` (with `base_url`) and `network_config_without_base_url`. Added `enforce_http2`, `stream_idle_timeout_in_seconds`, `max_conns_per_host`, `beta_header_overrides`, and `ca_cert_pem` fields. Renamed `retry_backoff_initial_ms`/`retry_backoff_max_ms` to `retry_backoff_initial`/`retry_backoff_max`.

- **Schema: MCP changes**: Removed `websocket` connection type; added `inprocess`. Added `per_user_oauth` auth type. Added `mcp_client_name` for config-file resolution. Added `allowed_extra_headers` and `allow_on_all_virtual_keys` to `mcp_client_config`. Added `disable_auto_tool_inject` to MCP plugin config. Added global `mcp_disable_auto_tool_inject` and `routing_chain_max_depth` to server config.

- **Schema: routing rules**: Added `chain_rule` boolean to `routing_rule`. Made `scope_id` required (non-null string) when `scope` is `team`, `customer`, or `virtual_key`.

- **Schema: budgets**: Moved `calendar_aligned` from the budget object to the virtual key level. Replaced `budget_id` on virtual key with `virtual_key_id`/`provider_config_id` on the budget object itself. Removed `budget_id` from `virtual_key_provider_config`.

- **Schema: logs store**: Added `object_storage` (S3/GCS) and `retention_days` to the logs store config.

- **Schema: pricing overrides**: Moved `pricing_overrides` from per-provider to a top-level array with scoped `provider_pricing_override` objects supporting `scope_kind`, `virtual_key_id`, `provider_id`, `provider_key_id`, `match_type`, `pattern`, `request_types`, and `pricing_patch`.

- **Schema: compat plugin**: Replaced `enable_litellm_fallbacks` with a structured `compat` object supporting `convert_text_to_chat`, `convert_chat_to_responses`, `should_drop_params`, and `should_convert_params`.

- **Schema: OTEL plugin**: Expanded `trace_type` enum to `genai_extension`, `vercel`, `open_inference` (was only `otel`).

- **Schema: SCIM**: Renamed `saml_config` to `scim_config`. Added `apiToken` to `okta_config` and made `clientSecret` and `apiToken` required. Changed cluster `dial_timeout` from integer (nanoseconds) to Go duration string.

- **Schema: misc**: Added `prompts` and `builtin` to plugin name/placement enums. Added `provider_configs` fields `id`, `description`, `network_config`, `proxy_config`, `custom_provider_config`, `concurrency_and_buffer_size`, and `openai_config`. Added `scim_config` top-level ref. Normalized multi-item `required` arrays to single-line format throughout.

## Type of change

- [ ] Bug fix
- [x] Feature
- [x] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)
- [x] Providers/Integrations
- [x] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

```sh
# Validate schema against existing configs
npx ajv validate -s transports/config.schema.json -d your-config.json

# Verify release pipeline runs without test gate
# Push a tagged commit and confirm release jobs trigger directly after detect-changes
```

If upgrading from v1.4.x, set `"version": 1` in your config to preserve allow-all semantics for empty arrays, or migrate empty arrays to `["*"]` and adopt v2 deny-by-default semantics.

## Breaking changes

- [x] Yes
- [ ] No

**Empty arrays in `allowed_models`, `key_ids`, `tools_to_execute`, `provider_configs`, and `mcp_configs` now deny all access by default (v2 semantics).** To allow all, use `["*"]`. To restore v1.4.x behavior, set `"version": 1` at the top level of your config.

`enable_litellm_fallbacks` has been removed; replace with the `compat` object. `saml_config` has been renamed to `scim_config`. `budget_id` has been removed from virtual keys and `virtual_key_provider_config`. `calendar_aligned` has moved from the budget object to the virtual key. `deployments` has been removed from `azure_key_config` and `vertex_key_config`; use `aliases` on the key instead. `retry_backoff_initial_ms`/`retry_backoff_max_ms` renamed to `retry_backoff_initial`/`retry_backoff_max`. The `websocket` MCP connection type has been removed; use `http` or `sse`. Okta SCIM config now requires `clientSecret` and `apiToken`.

## Related issues

N/A

## Security considerations

The `insecure_skip_verify` and `ca_cert_pem` fields on `network_config` expose TLS bypass options; these should only be used in controlled environments. The `per_user_oauth` auth type for MCP introduces per-user credential flows that require careful OAuth config management. Removal of test gates from the release pipeline means regressions from flaky provider APIs will no longer block releases, but also means real failures could ship if not caught by other means.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* force verstion back to go 1.26.1 (#2817)

## Summary

Bumps `core` to v1.4.21 and updates `transports` to depend on `core` v1.4.20, while removing a now-unnecessary workspace Go directive workaround that was previously required to satisfy the toolchain constraint introduced by `core` v1.4.19.

## Changes

- Incremented `core` version from `1.4.20` to `1.4.21`
- Updated `transports/go.mod` to reference `core` v1.4.20 (previously v1.4.19)
- Removed the `go work edit -go=1.26.2 -toolchain=go1.26.2` workaround from the workspace setup script, which was only needed to satisfy the toolchain requirement imposed by the published `core` v1.4.19

## Type of change

- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)
- [ ] Providers/Integrations
- [ ] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

```sh
go work sync
go test ./...
```

Verify the workspace initializes without the explicit Go/toolchain directive and that all modules resolve correctly.

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

## Security considerations

None.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* revert everything to go1.26.1 (#2818)

## Summary

Bumps the core version to `1.4.22` and rolls back dependency versions across the framework, plugins, and transports to align with a prior stable set of releases. This resolves a version inconsistency introduced by forward-referencing newer module versions that were not yet intended to be consumed by downstream packages.

## Changes

- Incremented `core/version` from `1.4.21` to `1.4.22`
- Downgraded `bifrost/core` from `v1.4.19` → `v1.4.17` across `framework`, `governance`, `jsonparser`, `litellmcompat`, `logging`, `maxim`, `mocker`, `otel`, `semanticcache`, and `telemetry` plugins
- Downgraded `bifrost/framework` from `v1.2.38` → `v1.2.36` (or `v1.2.35` for `semanticcache`) across all dependent plugins
- Downgraded `bifrost/core` in `transports` from `v1.4.20` → `v1.4.19`
- Downgraded all plugin versions referenced in `transports` (governance, litellmcompat, logging, maxim, otel, semanticcache, telemetry) to their corresponding prior releases
- Downgraded `go.opentelemetry.io/otel/sdk` and `go.opentelemetry.io/otel/sdk/metric` from `v1.43.0` → `v1.40.0` in affected plugins
- Bumped Go toolchain version in `transports/go.mod` from `1.26.1` to `1.26.2`

## Type of change

- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [x] Core (Go)
- [x] Transports (HTTP)
- [ ] Providers/Integrations
- [x] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

```sh
go test ./...
```

Verify that all modules resolve correctly with the pinned dependency versions and that no import errors occur during build.

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

N/A

## Security considerations

None. These are internal module version adjustments with no changes to auth, secrets, or data handling.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* bumped up hello-world dep (#2819)

## Summary

Pins the `bifrost/core` dependency in the example plugin modules to a consistent released version (`v1.4.17`), removing a local `replace` directive that was pointing to the local `core` module path.

## Changes

- Replaced the local `replace` directive in `hello-world-wasm-go/go.mod` with a direct reference to `github.com/maximhq/bifrost/core v1.4.17`
- Downgraded `hello-world/go.mod` from `v1.4.19` to `v1.4.17` to align both example plugins on the same released version

## Type of change

- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Documentation
- [x] Chore/CI

## Affected areas

- [ ] Core (Go)
- [ ] Transports (HTTP)
- [ ] Providers/Integrations
- [x] Plugins
- [ ] UI (Next.js)
- [ ] Docs

## How to test

```sh
cd examples/plugins/hello-world-wasm-go
go mod tidy
go build ./...

cd examples/plugins/hello-world
go mod tidy
go build ./...
```

## Breaking changes

- [ ] Yes
- [x] No

## Related issues

## Security considerations

No security implications. This change only affects dependency resolution for example plugin modules.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable

* framework: bump core to v1.4.22 --skip-pipeline

* plugins/governance: bump core to v1.4.22 and framework to v1.2.39 --skip-pipeline

* plugins/jsonparser: bump core to v1.4.22 and framework to v1.2.39 --skip-pipeline

* plugins/litellmcompat: bump core to v1.4.22 and framework to v1.2.39 --skip-pipeline

* plugins/logging: bump core to v1.4.22 and framework to v1.2.39 --skip-pipeline

* plugins/maxim: bump core to v1.4.22 and framework to v1.2.39 --skip-pipeline

* plugins/mocker: bump core to v1.4.22 and framework to v1.2.39 --skip-pipeline

* plugins/otel: bump core to v1.4.22 and framework to v1.2.39 --skip-pipeline

* plugins/semanticcache: bump core to v1.4.22 and framework to v1.2.39 --skip-pipeline

* plugins/telemetry: bump core to v1.4.22 and framework to v1.2.39 --skip-pipeline

* enforce go 1.26.1 (#2820)

* transports: update dependencies --skip-pipeline

* Adds changelog for v1.4.23 --skip-pipeline

* V1.5.0 (#2245)

* refactor: standardize empty array conventions for VK Provider & MCP Configs, and makes Provider Config weight optional for routing (#1932)

## Summary

Changes Virtual Key provider and MCP configurations from "allow-all by default" to "deny-by-default" security model. Virtual Keys now require explicit provider and MCP client configurations to allow access, improving security posture.

## Changes

- **Provider Configs**: Empty `provider_configs` now blocks all providers instead of allowing all
- **MCP Configs**: Empty `mcp_configs` now blocks all MCP tools instead of allowing all  
- **Weight Field**: Changed provider `weight` from required `float64` to optional `*float64` - null weight excludes provider from weighted routin…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants