feat: core tests retry logic enhancements by Pratham-Mishra04 · Pull Request #921 · maximhq/bifrost

Pratham-Mishra04 · 2025-11-24T07:15:44Z

Summary

fix: bedrock responses streaming last chunk indicator fixes
fix: gemini nil content check fixes
fix: handle responses.incomplete event in openai responses streaming
fix: stream accumulator nil content check fixes

Type of change

Affected areas

How to test

Describe the steps to validate this change. Include commands and expected outcomes.

# Core/Transports
go version
go test ./...

# UI
cd ui
pnpm i || npm i
pnpm test || npm test
pnpm build || npm run build

If adding new configs or environment variables, document them here.

Screenshots/Recordings

If UI changes, add before/after screenshots or short clips.

Breaking changes

Yes
No

If yes, describe impact and migration instructions.

Related issues

Link related issues and discussions. Example: Closes #123

Security considerations

Note any security implications (auth, secrets, PII, sandboxing, etc.).

Checklist

I read docs/contributing/README.md and followed the guidelines
I added/updated tests where appropriate
I updated documentation where needed
I verified builds succeed (Go and UI)
I verified the CI pipeline passes locally if applicable

github-actions · 2025-11-24T07:15:52Z

🧪 Test Suite Available

This PR can be tested by a repository admin.

Run tests for PR #921

Pratham-Mishra04 · 2025-11-24T07:16:01Z

fixes integration tests flow #922
fix: integration test cases #671
moves all tests related to core to corresponding files #916
feat: core tests retry logic enhancements #921 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

coderabbitai · 2025-11-24T07:16:02Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed potential crashes from nil content checks in streaming operations across multiple providers
- Improved OpenAI responses streaming event handling for incomplete responses
- Fixed Bedrock streaming last chunk indicator logic
Tests
- Enhanced test retry framework for improved reliability testing across all scenarios
- Added visual error indicators (❌) for better test result visibility
- Improved streaming validation with more robust error handling
Chores
- Version updates across core, framework, and plugin components

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Walkthrough

Adds nil guards for message Content, refines streaming end/latency handling, and introduces an extensive test retry/validation framework. Many tests now use retry/validation wrappers and relax content-length constraints; various changelogs and minor config ordering fixes were applied.

Changes

Cohort / File(s)	Summary
Core providers `core/providers/gemini/utils.go`, `core/providers/bedrock/bedrock.go`, `core/providers/openai/openai.go`	Add nil-guards for message content (Gemini); only set latency on the last finalResponse (Bedrock); treat `responses.incomplete` as stream end like `Completed` (OpenAI).
Framework — streaming `framework/streaming/accumulator.go`, `framework/streaming/responses.go`	Add nil checks and initialize `Content` when appending; remove finalization-based tool-call population; avoid nil dereferences in accumulator logic.
Transports / config `transports/bifrost-http/lib/config.go`	Change provider add order to persist first then update in-memory, and map store ErrNotFound to public ErrNotFound.
Plugins — mocker `plugins/mocker/main.go`	Add nil-guards for `message.Content` in fast extraction paths and small formatting tweaks.
Tests — retry & validation framework `tests/core-providers/scenarios/test_retry_framework.go`, `tests/core-providers/scenarios/` (embedding.go, list_models.go, responses_stream.go, speech_synthesis.go, transcription.go, text_completion.go, tool_calls_streaming.go, etc.)	Introduce a unified retry/validation framework and many WithTestRetry / WithStreamValidationRetry wrappers, new retry config/types, validators and streaming validation result types; replace many direct calls with retry wrappers and relax/remove MinContentLength/MaxContentLength constraints.
Tests — streaming & tool-calls `tests/core-providers/scenarios/tool_calls_streaming.go`, `tests/core-providers/scenarios/responses_stream.go`, `tests/core-providers/scenarios/*_stream.go`	Centralize streaming retry/validation, add accumulators for tool-call aggregation, per-chunk validation, and switch streaming flows to validation-driven retry wrappers.
Tests — expectations & validation helpers `tests/core-providers/scenarios/response_validation.go`, `tests/core-providers/scenarios/validation_presets.go`	Remove Min/MaxContentLength from ResponseExpectations; add tool-argument validation fields; add ValidateListModelsResponse and list-models helpers; add ShouldHaveLatency and convert some failures to warnings.
Tests — misc updates `tests/core-providers/scenarios/utils.go`, `tests/core-providers/tests.go`, `tests/core-providers/scenarios/chat_completion_stream.go`	Add GetErrorMessage and TTS helper retry wiring; change test-summary labels from emoji to bracketed statuses; replace direct stream invocations with retry wrappers and improve error messages/formatting.
Framework / docs `core/changelog.md`, `framework/changelog.md`, `transports/changelog.md`, various `plugins/*/changelog.md` and `version` files	Changelog updates and many small version bumps across modules; add entries for streaming and nil content fixes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Test as Test Scenario
  participant Retry as Retry Wrapper
  participant Client as Provider Client
  participant Validator as Validator
  participant Accum as Accumulator

  Note over Test,Retry: New common flow for unary & streaming operations
  Test->>Retry: Invoke operation with RetryConfig + Context
  Retry->>Client: Attempt request / open stream
  Client-->>Retry: response or stream channel / error
  alt streaming
    Client->>Validator: stream chunks
    Validator->>Accum: accumulate content / tool-calls
    Validator-->>Retry: StreamingValidationResult (per attempt)
    Retry-->>Test: aggregated StreamingValidationResult after retries
  else unary
    Retry-->>Test: response or final error after retries
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Areas needing extra attention:

The new global retry framework and its many WithTestRetry / WithStreamValidationRetry wrappers and public types.
Changes to test expectations/validation (removal of length constraints, new list-models validation).
Streaming validation + accumulator interactions and tool-call aggregation logic.
Core provider streaming changes (Gemini nil-guards, Bedrock latency behavior, OpenAI incomplete handling).

Suggested reviewers

danpiths

Poem

"I hopped through code with nimble paws,
Guards in place to stop the flaws.
Streams now end and tests retry,
No nils to make the build go cry.
— a rabbit, with a tiny laptop 🐰"

Pre-merge checks and finishing touches

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is completely filled with template placeholders and contains no substantive information about the changes, objectives, or implementation details.	Replace template placeholders with actual descriptions of the retry logic enhancements, type of change, affected areas, related issues, and testing steps.
Linked Issues check	⚠️ Warning	The linked issue #123 is about Files API Support (OpenAI/Anthropic file uploads), which is unrelated to the PR's core tests retry logic enhancements described in the changeset.	Link to issues that describe retry logic enhancements for tests, or remove the unrelated Files API issue and document the actual objectives being addressed.
Out of Scope Changes check	⚠️ Warning	Extensive changes were made across test files (retry frameworks, validation logic, streaming adjustments) plus core providers, frameworks, and transports—far beyond the scope of a simple 'tests retry logic enhancements' feature.	Clarify the scope and document all affected subsystems. If changes span core providers and transports, update PR title and description to reflect the broader scope beyond just test enhancements.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: core tests retry logic enhancements' clearly describes the main change—adding retry logic enhancements to core tests.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78cb6be and dd8cd0c.

📒 Files selected for processing (53)

core/changelog.md (1 hunks)
core/providers/bedrock/bedrock.go (2 hunks)
core/providers/gemini/utils.go (2 hunks)
core/providers/openai/openai.go (1 hunks)
core/version (1 hunks)
framework/changelog.md (1 hunks)
framework/streaming/accumulator.go (3 hunks)
framework/streaming/responses.go (0 hunks)
framework/version (1 hunks)
plugins/governance/changelog.md (1 hunks)
plugins/governance/version (1 hunks)
plugins/jsonparser/changelog.md (1 hunks)
plugins/jsonparser/version (1 hunks)
plugins/logging/changelog.md (1 hunks)
plugins/logging/version (1 hunks)
plugins/maxim/changelog.md (1 hunks)
plugins/maxim/version (1 hunks)
plugins/mocker/changelog.md (1 hunks)
plugins/mocker/main.go (3 hunks)
plugins/mocker/version (1 hunks)
plugins/otel/changelog.md (1 hunks)
plugins/otel/version (1 hunks)
plugins/semanticcache/changelog.md (1 hunks)
plugins/semanticcache/version (1 hunks)
plugins/telemetry/changelog.md (1 hunks)
plugins/telemetry/version (1 hunks)
tests/core-providers/scenarios/chat_completion_stream.go (5 hunks)
tests/core-providers/scenarios/complete_end_to_end.go (0 hunks)
tests/core-providers/scenarios/embedding.go (1 hunks)
tests/core-providers/scenarios/end_to_end_tool_calling.go (0 hunks)
tests/core-providers/scenarios/error_parser.go (2 hunks)
tests/core-providers/scenarios/image_base64.go (0 hunks)
tests/core-providers/scenarios/image_url.go (0 hunks)
tests/core-providers/scenarios/list_models.go (3 hunks)
tests/core-providers/scenarios/multi_turn_conversation.go (0 hunks)
tests/core-providers/scenarios/multiple_images.go (0 hunks)
tests/core-providers/scenarios/reasoning.go (0 hunks)
tests/core-providers/scenarios/response_validation.go (10 hunks)
tests/core-providers/scenarios/responses_stream.go (9 hunks)
tests/core-providers/scenarios/speech_synthesis.go (3 hunks)
tests/core-providers/scenarios/speech_synthesis_stream.go (2 hunks)
tests/core-providers/scenarios/test_retry_framework.go (28 hunks)
tests/core-providers/scenarios/text_completion.go (1 hunks)
tests/core-providers/scenarios/text_completion_stream.go (3 hunks)
tests/core-providers/scenarios/tool_calls_streaming.go (4 hunks)
tests/core-providers/scenarios/transcription.go (7 hunks)
tests/core-providers/scenarios/transcription_stream.go (1 hunks)
tests/core-providers/scenarios/utils.go (2 hunks)
tests/core-providers/scenarios/validation_presets.go (2 hunks)
tests/core-providers/tests.go (1 hunks)
transports/bifrost-http/lib/config.go (13 hunks)
transports/changelog.md (1 hunks)
transports/version (1 hunks)

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

tests/core-providers/scenarios/utils.go (1)
588-612: Fix potential nil-pointer dereference in GetErrorMessage

GetErrorMessage assumes err.Error is non-nil; any *schemas.BifrostError without an Error field will panic when you access err.Error.Type, err.Error.Code, or err.Error.Message. This can crash tests any time a provider returns a structurally different error.

Recommend guarding err.Error and falling back to the core helper:
 func GetErrorMessage(err *schemas.BifrostError) string {
 	if err == nil {
 		return ""
 	}
 
-	errorType := ""
-	if err.Type != nil && *err.Type != "" {
-		errorType = *err.Type
-	}
-
-	if errorType == "" && err.Error.Type != nil && *err.Error.Type != "" {
-		errorType = *err.Error.Type
-	}
-
-	errorCode := ""
-	if err.Error.Code != nil && *err.Error.Code != "" {
-		errorCode = *err.Error.Code
-	}
-
-	errorMessage := err.Error.Message
-
-	errorString := fmt.Sprintf("%s %s: %s", errorType, errorCode, errorMessage)
-
-	return errorString
+	// If structured error details are missing, defer to core helper.
+	if err.Error == nil {
+		return bifrost.GetErrorMessage(err)
+	}
+
+	errorType := ""
+	if err.Type != nil && *err.Type != "" {
+		errorType = *err.Type
+	}
+	if errorType == "" && err.Error.Type != nil && *err.Error.Type != "" {
+		errorType = *err.Error.Type
+	}
+
+	errorCode := ""
+	if err.Error.Code != nil && *err.Error.Code != "" {
+		errorCode = *err.Error.Code
+	}
+
+	errorMessage := err.Error.Message
+	if errorMessage == "" {
+		return bifrost.GetErrorMessage(err)
+	}
+
+	return fmt.Sprintf("%s %s: %s", errorType, errorCode, errorMessage)
 }
This prevents panics and keeps error formatting consistent with core behavior when detailed fields are absent.
tests/core-providers/scenarios/speech_synthesis.go (1)
254-275: Fix loop variable capture in AllVoiceOptions voice-type subtests

In AllVoiceOptions, the loop:
for _, voiceType := range voiceTypes {
	t.Run("VoiceType_"+voiceType, func(t *testing.T) {
		if os.Getenv("SKIP_PARALLEL_TESTS") != "true" {
			t.Parallel()
		}
		voice := GetProviderVoice(testConfig.Provider, voiceType)
        ...
		voiceRetryContext := TestRetryContext{
			ScenarioName: "SpeechSynthesis_VoiceType_" + voiceType,
            ...
			TestMetadata: map[string]interface{}{
				"voice_type": voiceType,
				"voice":      voice,
			},
		}
captures the range variable voiceType inside a subtest that runs with t.Parallel. Depending on Go version/scheduling, all subtests can end up using the last voiceType, so you may not actually be exercising all three voice types.

Capture voiceType per iteration before calling t.Run:
-	for _, voiceType := range voiceTypes {
-		t.Run("VoiceType_"+voiceType, func(t *testing.T) {
+	for _, voiceType := range voiceTypes {
+		voiceType := voiceType // capture per-iteration value
+		t.Run("VoiceType_"+voiceType, func(t *testing.T) {
 			if os.Getenv("SKIP_PARALLEL_TESTS") != "true" {
 				t.Parallel()
 			}
 
 			voice := GetProviderVoice(testConfig.Provider, voiceType)
            ...
This ensures each subtest actually runs with its own intended voiceType.
tests/core-providers/scenarios/tool_calls_streaming.go (1)

94-179: Prevent spurious "default" key entries in ResponsesToolCalls map

The issue is confirmed. When AccumulateResponsesToolCall is called with all four parameters (itemID, callID, name, arguments) nil or empty—which is theoretically possible during malformed streaming chunks—the code creates an entry under the "default" key. This entry is never cleaned up because the key migration logic at line 146 explicitly skips deletion when key == "default". As a result, GetFinalResponsesToolCalls() can return an entry with empty ID, Name, and Arguments.

The suggested fix is sound: initialize key := "" instead of "default", and return early if no identifying information is present.

tests/core-providers/scenarios/responses_stream.go (1)

774-813: Align sequence-number validation: remove fatal failure or consolidate duplicate checks

The function at lines 778–781 uses t.Fatalf("⚠️ Warning:...") which contradicts the warning emoji—a fatal failure should use a clear failure marker. Additionally, validateResponsesStreamingResponse (lines 869–873) performs the identical sequence-number check but appends to an errors slice instead of failing fatally.

Change line 779 to use t.Logf with a failure emoji (❌) for consistency with other non-fatal checks (lines 793, 796, 799, 806), or consolidate both checks by removing the fatal one and relying on the soft error accumulation in validateResponsesStreamingResponse.
tests/core-providers/scenarios/embedding.go (1)
123-144: Fix invalid range over int and handle 2D embedding responses safely

In validateEmbeddingSemantics (tests/core-providers/scenarios/embedding.go, lines 123-144):

for i := range responseDataLength is not valid Go—range requires an iterable type (slice, array, map, channel, string), not int. This will fail to compile.

The logic creates an out-of-bounds risk: when responseDataLength is reassigned from len(Embedding2DArray) (line 129), subsequent indexing into response.Data[i] can panic if len(response.Data) < responseDataLength (e.g., single 2D embedding entry containing multiple vectors).

Replace the loop with explicit handling for both 1D and 2D embedding cases:
-	responseDataLength := len(response.Data)
-	if responseDataLength != len(testTexts) {
-		if responseDataLength > 0 && response.Data[0].Embedding.Embedding2DArray != nil {
-			responseDataLength = len(response.Data[0].Embedding.Embedding2DArray)
-		}
-		if responseDataLength != len(testTexts) {
-			t.Fatalf("Expected %d embedding results, got %d", len(testTexts), responseDataLength)
-		}
-	}
-
-	for i := range responseDataLength {
-		vec, extractErr := getEmbeddingVector(response.Data[i])
+	dataLen := len(response.Data)
+	if dataLen == len(testTexts) {
+		// Standard 1D embeddings: one entry per input
+		for i := 0; i < dataLen; i++ {
+			vec, extractErr := getEmbeddingVector(response.Data[i])
+			if extractErr != nil {
+				t.Fatalf("Failed to extract embedding vector for text '%s': %v", testTexts[i], extractErr)
+			}
+			if len(vec) == 0 {
+				t.Fatalf("Embedding vector is empty for text '%s'", testTexts[i])
+			}
+			embeddings[i] = vec
+		}
+	} else if dataLen > 0 && response.Data[0].Embedding.Embedding2DArray != nil {
+		// 2D embeddings: all vectors contained in a single entry
+		vectors := response.Data[0].Embedding.Embedding2DArray
+		if len(vectors) != len(testTexts) {
+			t.Fatalf("Expected %d embedding results, got %d", len(testTexts), len(vectors))
+		}
+		for i := 0; i < len(vectors); i++ {
+			vec := vectors[i]
+			if len(vec) == 0 {
+				t.Fatalf("Embedding vector is empty for text '%s'", testTexts[i])
+			}
+			embeddings[i] = vec
+		}
+	} else {
+		t.Fatalf("Expected %d embedding results, got %d", len(testTexts), dataLen)
 	}

🧹 Nitpick comments (15)

tests/core-providers/scenarios/utils.go (1)
536-563: Reuse default speech retry conditions to avoid weakening retries

You construct speechRetryConfig from DefaultSpeechRetryConfig() but drop its Conditions by setting an empty slice, so only validation failures and timeouts will trigger retries. If DefaultSpeechRetryConfig includes useful conditions (e.g., empty audio, generic errors), consider propagating them:
-	speechRetryConfig := SpeechRetryConfig{
-		MaxAttempts: retryConfig.MaxAttempts,
-		BaseDelay:   retryConfig.BaseDelay,
-		MaxDelay:    retryConfig.MaxDelay,
-		Conditions:  []SpeechRetryCondition{},
-		OnRetry:     retryConfig.OnRetry,
-		OnFinalFail: retryConfig.OnFinalFail,
-	}
+	speechRetryConfig := SpeechRetryConfig{
+		MaxAttempts: retryConfig.MaxAttempts,
+		BaseDelay:   retryConfig.BaseDelay,
+		MaxDelay:    retryConfig.MaxDelay,
+		Conditions:  retryConfig.Conditions, // preserve default speech retry conditions
+		OnRetry:     retryConfig.OnRetry,
+		OnFinalFail: retryConfig.OnFinalFail,
+	}
This keeps the helper aligned with the main speech retry behavior.

Also applies to: 565-569
tests/core-providers/scenarios/speech_synthesis_stream.go (1)
375-379: Capture per-iteration voice value inside streaming voice tests

In MultipleVoices_Streaming, the loop:
for _, voice := range voices {
    t.Run("StreamingVoice_"+voice, func(t *testing.T) {
        ...
        validationResult := WithSpeechStreamValidationRetry(... func(responseChannel chan *schemas.BifrostStream) SpeechStreamValidationResult {
            ...
            streamErrors = append(streamErrors, fmt.Sprintf("Received nil stream response for voice %s", voice))
            ...
        })
relies on the range variable voice inside nested closures. With t.Run + t.Parallel, this can lead to all subtests logging the same voice for errors, depending on Go version and scheduling.

You already introduce voiceCopy := voice for the request; reusing that in logs avoids surprises:
-	for _, voice := range voices {
-		t.Run("StreamingVoice_"+voice, func(t *testing.T) {
+	for _, voice := range voices {
+		voice := voice // capture per-iteration value
+		t.Run("StreamingVoice_"+voice, func(t *testing.T) {
 			if os.Getenv("SKIP_PARALLEL_TESTS") != "true" {
 				t.Parallel()
 			}
 
 			voiceCopy := voice
             ...
-						if response == nil {
-							streamErrors = append(streamErrors, fmt.Sprintf("Received nil stream response for voice %s", voice))
+						if response == nil {
+							streamErrors = append(streamErrors, fmt.Sprintf("Received nil stream response for voice %s", voiceCopy))
 							continue
 						}
             ...
-								if response.BifrostSpeechStreamResponse != nil && response.BifrostSpeechStreamResponse.Audio != nil && len(response.BifrostSpeechStreamResponse.Audio) > 0 {
-									receivedData = true
-									t.Logf("✅ Received data for voice %s: %d bytes", voice, len(response.BifrostSpeechStreamResponse.Audio))
+								if response.BifrostSpeechStreamResponse != nil && response.BifrostSpeechStreamResponse.Audio != nil && len(response.BifrostSpeechStreamResponse.Audio) > 0 {
+									receivedData = true
+									t.Logf("✅ Received data for voice %s: %d bytes", voiceCopy, len(response.BifrostSpeechStreamResponse.Audio))
 								}
 			...
-							if !receivedData {
-								validationErrors = append(validationErrors, fmt.Sprintf("Should receive audio data for voice %s", voice))
+							if !receivedData {
+								validationErrors = append(validationErrors, fmt.Sprintf("Should receive audio data for voice %s", voiceCopy))
 							}
This keeps logs and validation messages aligned with the actual voice used in the request.

Also applies to: 412-469
tests/core-providers/scenarios/speech_synthesis.go (1)

85-113: Speech retry integration across scenarios looks solid

Using GetTestRetryConfigForScenario → SpeechRetryConfig → WithSpeechTestRetry in the basic, HD, and voice-type tests is consistent with the shared retry framework. Pairing this with SpeechExpectations + ModifyExpectationsForProvider and then doing speech-specific checks (validateSpeechSynthesisSpecific, size thresholds, modelRequested logging) gives good coverage of both transport and content.

One nuance: each SpeechRetryConfig you construct sets Conditions: []SpeechRetryCondition{}. If your default speech retry config defines useful conditions (e.g., empty audio, generic error patterns), consider copying them from the scenario-level retryConfig/voiceRetryConfig instead of zeroing them out, so behavior stays aligned with other speech scenarios.

Also applies to: 100-112, 116-122, 192-207, 210-218, 222-227, 276-301, 279-301, 304-307, 308-314

tests/core-providers/scenarios/text_completion.go (1)

55-56: TextCompletion retry wiring + relaxed content checks are appropriate

Switching to TextCompletionRetryConfig + WithTextCompletionTestRetry and using GetTextCompletionContent aligns text completion with the shared retry framework. Removing strict keyword assertions is the right call for non-deterministic LLMs and keeps the test focused on functional validity via GetExpectationsForScenario/ModifyExpectationsForProvider.

As with the speech configs, note that Conditions is set to an empty slice; if you later define text-completion-specific retry conditions, you can pass them here without changing the call sites.

Also applies to: 57-65, 67-69, 75-76
tests/core-providers/scenarios/tool_calls_streaming.go (1)
218-342: Add a timeout / context-based select around the chat streaming loop to avoid hangs

The new Chat Completions tool-call streaming test reads from responseChannel with a bare for response := range responseChannel and no timeout or select. If the provider leaves the stream open or never sends any chunks, the test can hang indefinitely despite WithStreamRetry succeeding in establishing the channel.

To align with the more robust pattern you already use in other streaming tests (e.g., ResponsesStream with context.WithTimeout and a select on <-streamCtx.Done()), consider wrapping the chat streaming loop similarly:
-		for response := range responseChannel {
+		streamCtx, cancel := context.WithTimeout(ctx, 200*time.Second)
+		defer cancel()
+
+		for {
+			select {
+			case response, ok := <-responseChannel:
+				if !ok {
+					goto chatStreamComplete
+				}
+				// existing processing…
+			case <-streamCtx.Done():
+				t.Fatal("Timeout waiting for chat completions streaming response with tools")
+			}
+		}
+
+	chatStreamComplete:
This keeps the test from hanging while preserving the existing response processing and responseCount guard.
tests/core-providers/scenarios/validation_presets.go (1)

15-29: Expectation presets look consistent; be aware of boolean override semantics in CombineExpectations

The new defaults (e.g., ExpectedChoiceCount: 1, ShouldHaveLatency: true, and ShouldHaveContent: false for tool-only scenarios) line up with how the tests are written and the new streaming validations.

One nuance: CombineExpectations only overrides boolean fields when the later expectation’s value is true, so you can’t currently use it to force a flag back to false (e.g., turn ShouldHaveContent or ShouldHaveLatency off). That’s fine if these are intended as global “must be true” invariants, but worth keeping in mind if you later try to combine presets that intentionally relax them.

Also applies to: 62-81, 147-163, 166-187, 387-446

tests/core-providers/scenarios/transcription.go (1)

80-112: Retry wiring for TTS and transcription looks solid; consider handling TTS generation errors explicitly

The new use of GetTestRetryConfigForScenario + {Speech,Transcription}RetryConfig + With{Speech,Transcription}TestRetry cleanly standardizes retry behavior across all transcription-related tests and keeps the validation logic localized.

One small robustness gap: calls to GenerateTTSAudioForTest (e.g., in AdditionalAudioTests, AllResponseFormats, WithCustomParameters, MultipleLanguages) ignore its error return and assume audioData is valid. If that helper can fail (e.g., provider outage), it would be clearer to:

Check the error and fail fast with a descriptive message, or

Surface it through the same retry framework if appropriate.

Otherwise, the new retry integration and enriched error messages look good.

Also applies to: 142-177, 231-269, 313-352, 383-421, 453-492

tests/core-providers/scenarios/responses_stream.go (1)

65-241: Validation callback is thorough; consider using expectations/latency more explicitly

The new WithResponsesStreamValidationRetry callback does a nice job tracking event types, sequence numbers, and assembled content, and it cleanly returns a ResponsesStreamValidationResult for the retry wrapper to act on.

A couple of small polish opportunities:

You build expectations := GetExpectationsForScenario("ResponsesStream", ...) and tweak ShouldContainKeywords, but the resulting expectations aren’t currently used in any validation. Either wire them into validateResponsesStreamingResponse (e.g., to assert on keywords) or drop this block to avoid confusion.

LastLatency in the returned ResponsesStreamValidationResult is hard-coded to 0 even though lastResponse is tracked and its latency is logged. If callers ever want to assert on end-to-end latency, consider populating LastLatency from lastResponse.BifrostResponsesStreamResponse.ExtraFields.Latency.

These are non-blocking, but would make the new validation result more self-contained and intention-revealing.
tests/core-providers/scenarios/response_validation.go (1)
276-311: List-models validation is solid; consider aligning latency semantics with other validators

The new ValidateListModelsResponse, validateListModelsFields, and collectListModelsResponseMetrics cover key integrity points: non-empty Data, non-empty IDs, provider and request type checks, optional expected_provider and min_model_count, and basic latency and pagination metrics. This should work well with the new list-models retry flow.

One minor consistency point: here you treat ExtraFields.Latency < 0 as invalid and accept 0 as OK, whereas other validators treat <= 0 as an error when ShouldHaveLatency is true. If a zero latency is effectively “missing/invalid” for tests (as it seems in other paths), aligning the check to <= 0 would make behavior consistent across response types.

Example tweak:
-	if expectations.ShouldHaveLatency {
-		if response.ExtraFields.Latency < 0 {
-			result.Passed = false
-			result.Errors = append(result.Errors, fmt.Sprintf("Invalid latency: %d ms (should be non-negative)", response.ExtraFields.Latency))
-		} else {
-			result.MetricsCollected["latency_ms"] = response.ExtraFields.Latency
-		}
-	}
+	if expectations.ShouldHaveLatency {
+		if response.ExtraFields.Latency <= 0 {
+			result.Passed = false
+			result.Errors = append(result.Errors, fmt.Sprintf("Expected latency information but not present or invalid (got %d ms)", response.ExtraFields.Latency))
+		} else {
+			result.MetricsCollected["latency_ms"] = response.ExtraFields.Latency
+		}
+	}
If 0 is intentionally allowed for list-models, then current code is fine; just be aware it diverges from other validators.

Also applies to: 1030-1102
tests/core-providers/scenarios/test_retry_framework.go (6)
163-167: ListModels-specific types are good, but default config API is a bit inconsistent

Introducing ListModelsRetryCondition and ListModelsRetryConfig mirrors the other typed configs and makes list-models tests easier to reason about.

Two small inconsistencies worth addressing:

DefaultListModelsRetryConfig returns TestRetryConfig, not ListModelsRetryConfig, despite the name/comment. That’s fine for GetTestRetryConfigForScenario, but confusing for callers who might expect a ListModelsRetryConfig suitable for WithListModelsTestRetry.

ListModelsRetryConfig.Conditions and checkListModelsRetryConditions are currently unused in WithListModelsTestRetry, so providing conditions has no effect.

If you want to keep the generic and specific paths, consider:

Either renaming DefaultListModelsRetryConfig to clarify it returns TestRetryConfig for scenario-level wiring, or

Adding a parallel helper such as DefaultListModelsOperationRetryConfig() ListModelsRetryConfig and/or actually invoking checkListModelsRetryConditions inside WithListModelsTestRetry when building retryReason.

This is API polish rather than a correctness bug, but tightening it now will avoid confusion later.

Also applies to: 247-255, 940-957, 1144-1145

258-273: Standardizing default retry configs to 10 attempts with 2s base/10s max backoff

You’ve normalized most of the default TestRetryConfig creators (general, tool-calls, multi-tool, image, streaming, conversation, speech, transcription, reasoning, embedding) to MaxAttempts: 10, BaseDelay: 2s, MaxDelay: 10s. That’s consistent with the framework’s goal of being resilient to flaky LLM behavior.

The only trade-off is potential test runtime increase in worst-case failure scenarios. If that ever becomes an issue in CI, these values might be good candidates to drive from shared config (e.g., config.ComprehensiveTestConfig) rather than hard-coded literals.

Also applies to: 770-783, 787-799, 803-818, 820-841, 845-857, 861-873, 875-894, 897-910, 913-922, 925-938

722-764: Timeout detection is robust but “timeout” substring may be a bit broad

timeoutPhrases plus isTimeoutError provide a pragmatic way to treat gateway/context timeouts as special, checking both the raw Error.Message and the higher-level GetErrorMessage(err).

One thing to be aware of is that the bare "timeout" phrase could, in theory, produce false positives on unrelated messages that happen to include that word. If you ever see misclassified errors in logs, consider tightening that particular phrase (e.g., to " timeout" or more specific variants) while keeping the others as-is.

968-1100: Dual API retry: behavior is good; consider adding ❌ prefix to timeout retry reasons

WithDualAPITestRetry correctly:

Treats timeouts from either API as a primary retry trigger with exponential backoff.

Retries when either Chat or Responses validation fails, and logs per-API validation details with ❌ prefixes.

On final failure, wraps accumulated errors with a prefixed ❌ dual API test failed … message.

One minor consistency issue: the timeout-only retryReason strings ("Chat API timeout error: ...", "Responses API timeout error: ..." ) don’t include an explicit ❌ prefix, unlike the other retry paths. To keep logs uniform you could do:
- retryReason = fmt.Sprintf("Chat API timeout error: %s", GetErrorMessage(chatErr))
+ retryReason = fmt.Sprintf("❌ Chat API timeout error: %s", GetErrorMessage(chatErr))
...
- retryReason = fmt.Sprintf("Responses API timeout error: %s", GetErrorMessage(responsesErr))
+ retryReason = fmt.Sprintf("❌ Responses API timeout error: %s", GetErrorMessage(responsesErr))
This is purely cosmetic but aligns with the rest of the framework’s error messaging.

1174-1831: Per-operation retry wrappers (TextCompletion/Speech/Embedding/Transcription) are consistent but heavily duplicated

The new WithTextCompletionTestRetry, WithSpeechTestRetry, WithEmbeddingTestRetry, and WithTranscriptionTestRetry:

Follow the same pattern as the Chat/Responses wrappers (validate-first, always retry on validation errors, special handling for timeouts).

Ensure OnFinalFail messages include a ❌ prefix and surface the last validation errors clearly.

Provide operation-specific condition hooks via checkTextCompletionRetryConditions, checkSpeechRetryConditions, checkEmbeddingRetryConditions, and checkTranscriptionRetryConditions.

Functionally this is solid and matches the documented strategy. The only downside is the amount of near-identical retry-loop code, which may be harder to maintain long-term if you tweak behavior (e.g., adding a new retry reason) and need to change it in several places.

If this framework keeps growing, consider extracting a generic “validate + retry” helper that takes typed adapters (validate func, condition checker, final error builder) to reduce duplication, but this is not urgent.

1843-1965: ListModels retry wrapper matches “always retry on any failure” but doesn’t use its Conditions or ❌ prefixes consistently

WithListModelsTestRetry:

Retries on any error, nil response, validation failure, or empty Data, which matches the commentary about being maximally defensive for list-models tests.

Uses ValidateListModelsResponse to centralize structural checks, and avoids exiting early until either validation passes with no error or attempts are exhausted.

Two follow-ups to consider:

Conditions are ignored
Even though ListModelsRetryConfig has a Conditions []ListModelsRetryCondition field and you’ve added checkListModelsRetryConditions, the wrapper never calls it. If you intend these conditions to be usable, you probably want to fold them into the shouldRetry / retryReason computation (even if validation failures remain the primary trigger).

Logging consistency
Some messages in this path lack the ❌ prefix that you’ve standardized elsewhere:

Timeout branch: retryReason = fmt.Sprintf("timeout error detected: %s", ...)

Final OnFinalFail: fmt.Errorf("final error: %s", errorMsg) with no prefix.

To keep logs coherent, you could add the prefix in these few places, similar to the other wrappers.

Neither affects correctness, but addressing both would make the list-models path behave more like the rest of the framework and make the new condition type actually useful.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1d55601 and 8661443.

📒 Files selected for processing (28)

core/providers/gemini/utils.go (2 hunks)
framework/streaming/accumulator.go (3 hunks)
plugins/mocker/main.go (3 hunks)
tests/core-providers/scenarios/chat_completion_stream.go (2 hunks)
tests/core-providers/scenarios/complete_end_to_end.go (0 hunks)
tests/core-providers/scenarios/embedding.go (1 hunks)
tests/core-providers/scenarios/end_to_end_tool_calling.go (0 hunks)
tests/core-providers/scenarios/error_parser.go (2 hunks)
tests/core-providers/scenarios/image_base64.go (0 hunks)
tests/core-providers/scenarios/image_url.go (0 hunks)
tests/core-providers/scenarios/list_models.go (3 hunks)
tests/core-providers/scenarios/multi_turn_conversation.go (0 hunks)
tests/core-providers/scenarios/multiple_images.go (0 hunks)
tests/core-providers/scenarios/reasoning.go (0 hunks)
tests/core-providers/scenarios/response_validation.go (10 hunks)
tests/core-providers/scenarios/responses_stream.go (6 hunks)
tests/core-providers/scenarios/speech_synthesis.go (3 hunks)
tests/core-providers/scenarios/speech_synthesis_stream.go (3 hunks)
tests/core-providers/scenarios/test_retry_framework.go (28 hunks)
tests/core-providers/scenarios/text_completion.go (1 hunks)
tests/core-providers/scenarios/text_completion_stream.go (4 hunks)
tests/core-providers/scenarios/tool_calls_streaming.go (4 hunks)
tests/core-providers/scenarios/transcription.go (7 hunks)
tests/core-providers/scenarios/transcription_stream.go (4 hunks)
tests/core-providers/scenarios/utils.go (1 hunks)
tests/core-providers/scenarios/validation_presets.go (2 hunks)
tests/core-providers/tests.go (1 hunks)
transports/bifrost-http/lib/config.go (13 hunks)

💤 Files with no reviewable changes (7)

tests/core-providers/scenarios/multi_turn_conversation.go
tests/core-providers/scenarios/image_url.go
tests/core-providers/scenarios/end_to_end_tool_calling.go
tests/core-providers/scenarios/multiple_images.go
tests/core-providers/scenarios/image_base64.go
tests/core-providers/scenarios/reasoning.go
tests/core-providers/scenarios/complete_end_to_end.go

🧰 Additional context used

🧬 Code graph analysis (9)

tests/core-providers/scenarios/speech_synthesis.go (4)

tests/core-providers/scenarios/test_retry_framework.go (5)

GetTestRetryConfigForScenario (1118-1150)

TestRetryContext (170-175)

SpeechRetryConfig (218-225)

SpeechRetryCondition (146-149)

WithSpeechTestRetry (1328-1478)

core/schemas/speech.go (1)

BifrostSpeechResponse (22-29)

transports/bifrost-http/handlers/inference.go (1)

SpeechRequest (206-210)

core/utils.go (1)

GetErrorMessage (239-276)

tests/core-providers/scenarios/chat_completion_stream.go (2)

tests/core-providers/scenarios/test_retry_framework.go (3)

StreamingRetryConfig (821-841)

TestRetryContext (170-175)

WithStreamRetry (582-690)

core/schemas/bifrost.go (3)

BifrostStream (318-325)

BifrostError (353-362)

ChatCompletionStreamRequest (90-90)

plugins/mocker/main.go (1)

core/providers/gemini/types.go (1)

Content (871-879)

transports/bifrost-http/lib/config.go (2)

core/schemas/mcp.go (1)

MCPConfig (11-13)

framework/configstore/clientconfig.go (1)

GovernanceConfig (67-74)

tests/core-providers/scenarios/utils.go (5)

tests/core-providers/scenarios/test_retry_framework.go (5)

DefaultSpeechRetryConfig (860-873)

TestRetryContext (170-175)

SpeechRetryConfig (218-225)

SpeechRetryCondition (146-149)

WithSpeechTestRetry (1328-1478)

tests/core-providers/scenarios/validation_presets.go (2)

SpeechExpectations (148-164)

ModifyExpectationsForProvider (301-348)

core/schemas/speech.go (1)

BifrostSpeechResponse (22-29)

core/schemas/bifrost.go (2)

BifrostError (353-362)

SpeechRequest (94-94)

core/utils.go (1)

GetErrorMessage (239-276)

tests/core-providers/scenarios/speech_synthesis_stream.go (1)

tests/core-providers/scenarios/test_retry_framework.go (2)

WithSpeechStreamValidationRetry (1990-2128)

SpeechStreamValidationResult (1980-1986)

tests/core-providers/scenarios/response_validation.go (4)

core/schemas/models.go (1)

BifrostListModelsResponse (36-45)

core/schemas/bifrost.go (3)

BifrostError (353-362)

RequestType (83-83)

ListModelsRequest (86-86)

tests/core-providers/scenarios/error_parser.go (3)

ParseBifrostError (47-104)

FormatErrorConcise (353-356)

LogError (359-366)

plugins/mocker/main.go (1)

Latency (126-130)

tests/core-providers/scenarios/test_retry_framework.go (11)

core/schemas/models.go (1)

BifrostListModelsResponse (36-45)

core/schemas/bifrost.go (3)

BifrostError (353-362)

ErrorField (371-378)

BifrostStream (318-325)

tests/core-providers/scenarios/response_validation.go (6)

ResponseExpectations (18-43)

ValidateChatResponse (68-111)

ValidateTextCompletionResponse (114-154)

ValidateSpeechResponse (203-237)

ValidateEmbeddingResponse (314-348)

ValidateTranscriptionResponse (240-274)

core/schemas/chatcompletions.go (1)

BifrostChatResponse (25-40)

core/utils.go (1)

GetErrorMessage (239-276)

tests/core-providers/scenarios/test_retry_conditions.go (3)

StreamErrorCondition (537-537)

IncompleteStreamCondition (582-582)

EmptySpeechCondition (667-667)

core/schemas/textcompletions.go (1)

BifrostTextCompletionResponse (69-77)

core/schemas/speech.go (1)

BifrostSpeechResponse (22-29)

core/schemas/embedding.go (1)

BifrostEmbeddingResponse (22-28)

core/schemas/transcriptions.go (1)

BifrostTranscriptionResponse (16-26)

tests/core-providers/scenarios/error_parser.go (2)

ParseBifrostError (47-104)

FormatErrorConcise (353-356)

tests/core-providers/scenarios/responses_stream.go (5)

tests/core-providers/scenarios/test_retry_framework.go (3)

StreamingRetryConfig (821-841)

TestRetryContext (170-175)

WithStreamRetry (582-690)

core/schemas/bifrost.go (2)

BifrostStream (318-325)

BifrostError (353-362)

ui/lib/types/logs.ts (1)

BifrostError (223-229)

core/schemas/responses.go (6)

ResponsesStreamResponseType (1351-1351)

BifrostResponsesStreamResponse (1421-1459)

ResponsesStreamResponseTypeOutputTextDelta (1369-1369)

ResponsesStreamResponseTypeOutputItemAdded (1363-1363)

ResponsesStreamResponseTypeContentPartAdded (1366-1366)

ResponsesStreamResponseTypeError (1418-1418)

tests/core-providers/scenarios/validation_presets.go (2)

GetExpectationsForScenario (209-294)

ModifyExpectationsForProvider (301-348)

🔇 Additional comments (28)

transports/bifrost-http/lib/config.go (2)

1615-1632: LGTM! Persistence-first order improves consistency.

The reversal of operations (store update before in-memory update) ensures that the in-memory Providers map only reflects successfully persisted state. If ConfigStore.AddProvider fails, the early return prevents the in-memory update, maintaining consistency.

Note: If UpdateEnvKeys fails (line 1623), a warning is logged but execution continues to update the in-memory map (line 1631). This creates a minor inconsistency where EnvKeys in the store may be stale while Providers is updated, though this appears intentional given the warning-level handling.

1615-1632: Clarify relationship to PR objectives.

The PR is titled "feat: core tests retry logic enhancements" and the AI summary mentions defensive nil checks and retry frameworks for tests. However, this file contains persistence order changes in AddProvider with no visible retry logic, nil checks, or test-related modifications.

Please clarify whether this file's changes are:

Part of a broader refactor unrelated to the PR's stated objectives, or

Accidentally included in this PR, or

Related in a way not evident from the diff

framework/streaming/accumulator.go (1)

268-296: LGTM! Robust nil guards prevent panics.

The added nil checks for message.Content before accessing ContentStr and ContentBlocks are defensive and prevent potential nil pointer dereferences during streaming accumulation. The initialization logic (lines 290-292) ensures Content is non-nil before assignment, maintaining safe content assembly.

core/providers/gemini/utils.go (2)

416-431: LGTM! Nil guards align with broader safety improvements.

The nil checks for message.Content before accessing ContentStr and ContentBlocks prevent potential nil dereferences during Bifrost-to-Gemini message conversion. This aligns with the nil-safety pattern applied across framework/streaming/accumulator.go and plugins/mocker/main.go.

477-493: LGTM! Safe content extraction for tool responses.

The nil guard for message.Content before extracting content for tool responses is consistent with the PR-wide nil-safety improvements. The fallback logic assembles text from ContentBlocks when present, maintaining safe access patterns.

plugins/mocker/main.go (1)

647-723: LGTM! Nil guards in content extraction prevent crashes.

The nil checks for message.Content before accessing ContentStr in both single-message (line 671) and multi-message (line 680) paths, as well as in ResponsesRequest flows (lines 698-716), are defensive and prevent potential nil dereferences during fast content extraction for mocking rules. This aligns with the nil-safety improvements in framework/streaming/accumulator.go and core/providers/gemini/utils.go.

tests/core-providers/tests.go (1)

66-118: LGTM! Consistent status formatting for test summaries.

The shift from emoji-based prefixes (✅/❌) to bracketed status indicators ([ENABLED]/[SKIPPED]/[TOTAL]) improves consistency and machine-readability in test output without altering functionality.

tests/core-providers/scenarios/error_parser.go (1)

378-399: LGTM! Consistent error prefix formatting.

The update ensures all RequireNoError failures include a ❌ prefix for visual consistency with retry-enabled test paths. The defensive check (line 394) prevents duplicate prefixes if already present.
tests/core-providers/scenarios/transcription_stream.go (3)
84-116: LGTM! Retry framework improves TTS generation reliability.

The integration of WithSpeechTestRetry with a structured retry config and context adds resilience to TTS generation for transcription tests. The expectations are dynamically adjusted per provider via ModifyExpectationsForProvider, which is appropriate for handling provider-specific behavior.

148-166: LGTM! Streaming retry wrapper enhances test stability.

The WithStreamRetry wrapper for TranscriptionStreamRequest provides structured retry logic with clear context metadata (scenario name, expected behavior, provider/model info), improving resilience against transient streaming failures.

264-266: Verify latency validation relaxation doesn't mask issues.

The shift from implicit fatal assertion to a warning when lastTokenLatency == 0 (also at lines 500, 595) reduces test strictness. While this may prevent flaky failures if zero latency is legitimate for some providers or inter-chunk responses, it could also hide scenarios where latency reporting is broken.

Confirm that zero latency is expected/acceptable in streaming transcription chunks for the providers under test, or consider logging provider context when this warning triggers to aid debugging:
 if lastTokenLatency == 0 {
-    t.Logf("⚠️ Warning: Last token latency is 0")
+    t.Logf("⚠️ Warning: Last token latency is 0 (provider: %s, model: %s)", testConfig.Provider, testConfig.TranscriptionModel)
 }
tests/core-providers/scenarios/chat_completion_stream.go (3)
42-60: LGTM! Retry wrapper improves streaming test reliability.

The integration of WithStreamRetry for chat completion streaming with a detailed retry context (scenario name, expected behavior, metadata) adds structured resilience to handle transient streaming failures.

186-193: Verify content-length validation removal doesn't reduce coverage.

The removal of explicit MinContentLength and MaxContentLength expectations relaxes validation strictness. While this may reduce flaky test failures caused by variable response lengths, it also eliminates bounds checking that could catch truncated or bloated responses.

Confirm this relaxation aligns with the test's intent. If length bounds are still important, consider adding them back as warnings rather than hard failures, or verify that other validations (e.g., ShouldContainAnyOf for "paris") provide sufficient coverage:
// Option: Add length as a warning-level check if needed
if len(finalContent) < 50 || len(finalContent) > 1000 {
    t.Logf("⚠️ Content length outside expected range: %d chars", len(finalContent))
}
241-259: LGTM! Retry framework for streaming with tools.

The retry wrapper for streaming with tool calls includes appropriate expectations (should_have_tool_calls, tool_name: "get_weather") and metadata, enhancing test resilience while maintaining validation rigor.
tests/core-providers/scenarios/text_completion_stream.go (4)
52-70: LGTM! Retry framework for text completion streaming.

The WithStreamRetry wrapper for TextCompletionStreamRequest with a structured retry config and context aligns with the retry framework applied across other streaming test scenarios (chat completion, transcription), improving test stability.

181-187: Verify latency validation relaxation aligns with expected behavior.

Similar to the transcription streaming tests, the shift from an implicit fatal assertion to a warning when Latency <= 0 (line 182-183) reduces strictness. Confirm this is intentional and that zero/negative latency is acceptable for inter-chunk responses or specific providers.

Consider logging provider and model context when this warning triggers to aid debugging:
 if lastResponse.BifrostTextCompletionResponse.ExtraFields.Latency <= 0 {
-    t.Logf("⚠️ Warning: Last streaming chunk missing latency information (got %d ms)", lastResponse.BifrostTextCompletionResponse.ExtraFields.Latency)
+    t.Logf("⚠️ Warning: Last streaming chunk missing latency information (got %d ms, provider: %s, model: %s)", 
+        lastResponse.BifrostTextCompletionResponse.ExtraFields.Latency, testConfig.Provider, model)
 }
253-269: LGTM! Dynamic retry contexts for varied prompts.

The use of fmt.Sprintf to create scenario-specific retry context names (e.g., TextCompletionStreamVariedPrompts_<name>) provides clear traceability in retry logs. The retry wrapper integration is consistent and appropriate.

395-411: LGTM! Retry framework for parameterized streaming tests.

The retry wrapper for parameter tests with dynamic scenario naming maintains the structured retry pattern applied across the test suite, enhancing resilience while preserving test intent.
tests/core-providers/scenarios/list_models.go (2)

31-77: ListModels retry integration looks correct and robust

The new retryConfig/context, expectations, and ListModelsRetryConfig wiring into WithListModelsTestRetry are coherent, and the post-retry checks (error, nil response, empty data) align with the expectations (min_model_count: 1, latency presence). No correctness issues spotted here.

Also applies to: 81-81

131-167: Pagination + next-page retry flow is consistent with base ListModels

The pagination test’s retry setup (including min_model_count: 0 for sparse pages) and reuse of listModelsRetryConfig for the next-page fetch are consistent with the base ListModels path and should make this test resilient to transient issues while still enforcing page-size and structural checks.

Also applies to: 187-209

tests/core-providers/scenarios/speech_synthesis_stream.go (1)

218-220: Treating zero last-token latency as a warning is appropriate

Downgrading lastTokenLatency == 0 to a warning log (instead of a hard failure) keeps tests from flaking on providers that omit or zero-fill per-chunk latency while still surfacing the anomaly.

Also applies to: 343-345

tests/core-providers/scenarios/response_validation.go (3)

46-53: ToolCallExpectation struct looks consistent with downstream validation

The fields here line up with how validateSingleToolCall inspects required/forbidden args, types, and values, and ValidateArgsJSON gates JSON parsing correctly. No issues from a correctness standpoint.

385-449: Converting content expectations from hard failures to warnings is appropriate for LLM nondeterminism

The shift from errors to warnings for keyword presence, forbidden words, and pattern matching in chat/text/responses content validation keeps structural failures as true test blockers while avoiding flakes from natural output variance. The added context (truncated content in messages) is helpful for debugging and the behavior is consistent across all three validators.

Also applies to: 571-636, 715-781

1348-1369: Error logging now consistently prefixes failures with ❌

Ensuring each individual validation error line gets a ❌ prefix (without double-prefixing) makes logs much easier to scan, and matches the rest of the retry/error formatting in this PR. Implementation is straightforward and side‑effect free.

tests/core-providers/scenarios/test_retry_framework.go (4)

16-40: High-level retry framework docs are clear and match implementation

The top-of-file commentary (validation-first, retries on validation failures and timeouts, conditions as secondary) matches how the various With...TestRetry helpers now behave. This will be useful for future contributors navigating a fairly involved test harness.

275-425: Chat and Responses retry wrappers correctly prioritize validation failures and timeouts

WithChatTestRetry and WithResponsesTestRetry:

Validate responses first and immediately return on success.

Always retry when there are validation errors (regardless of conditions), matching the documented strategy.

Give timeout errors precedence via isTimeoutError, and prefix timeout and validation retry reasons with ❌ for clear logs.

On exhausted attempts, synthesize a BifrostError with status 400 and a clearly marked ❌ Validation failed … message to force test failure, and ensure OnFinalFail messages also carry the prefix.

The loop bounds use attempt <= config.MaxAttempts and always break/return, so there’s no risk of infinite retry. This looks correct and consistent with the framework goals.

Also applies to: 428-579

581-707: Streaming retry helper now has clear timeout handling and consistent error formatting

WithStreamRetry:

Always retries on timeout errors (via isTimeoutError) irrespective of other conditions.

Falls back to condition-based retries and, if none match, still retries on any streaming error, which is reasonable given the flakiness of streams.

Adds ❌ prefixes to both first-attempt error logs and final-failure logs, keeping output consistent with non-stream paths.

Given MaxAttempts is bounded and backoff is capped, this is safe from a retry-storm perspective.

1980-2287: Streaming validation retry wrappers add useful end-to-end coverage for speech and responses

WithSpeechStreamValidationRetry and WithResponsesStreamValidationRetry:

Treat stream acquisition errors similarly to WithStreamRetry (timeouts always retried, condition-based retries next, then a “retry on any error” fallback for streaming).

Add another retry layer around content validation, using validateStream to consume the channel and decide whether to retry based on errors / stream errors.

Aggregate all errors and ensure final OnFinalFail and fallback logs are prefixed with ❌, which keeps streaming diagnostics consistent with non-stream tests.

The control flow always terminates (bounded by MaxAttempts), and the separation of concerns (operation vs. validateStream) is clear. No functional issues here.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tests/core-providers/scenarios/response_validation.go (1)
46-53: Re‑gating JSON parsing in validateSingleToolCall can break existing expectations if ValidateArgsJSON is left false

With ValidateArgsJSON added, validateSingleToolCall only populates args when expected.ValidateArgsJSON is true, but the required/forbidden/type/value checks run unconditionally against args:

When ValidateArgsJSON is false (the zero value for existing presets), args stays nil.

All RequiredArgs will then be treated as missing, and other checks will silently see no values, changing semantics from the prior behavior where arguments were always parsed when possible.

If older ToolCallExpectation values weren’t updated to set ValidateArgsJSON = true, you’ll see spurious “missing required argument” / type mismatch errors even though the underlying arguments are unchanged.

Consider one of the following to avoid surprises:
func validateSingleToolCall(arguments interface{}, expected ToolCallExpectation, choiceIdx, callIdx int, result *ValidationResult) {
-    var args map[string]interface{}
+    var args map[string]interface{}

-    if expected.ValidateArgsJSON {
+    if expected.ValidateArgsJSON {
         // current JSON parsing logic...
     }
+
+    // If we never parsed args and we’re not enforcing JSON validity, skip argument-level checks.
+    if args == nil && !expected.ValidateArgsJSON {
+        return
+    }
or, alternatively, always attempt to coerce arguments into a map[string]interface{} (maps, JSON strings, etc.) but only treat JSON parse failures as hard errors when ValidateArgsJSON is true.

Also applies to: 1256-1345
tests/core-providers/scenarios/embedding.go (1)
135-144: for i := range responseDataLength does not compile; needs an index‑based loop

responseDataLength is an int, so:
for i := range responseDataLength {
won’t compile (cannot range over responseDataLength (variable of type int)), and embedding.go won’t build.

You already enforce responseDataLength == len(testTexts) above, so an index loop is sufficient:
- for i := range responseDataLength {
+ for i := 0; i < responseDataLength; i++ {
    vec, extractErr := getEmbeddingVector(response.Data[i])
    if extractErr != nil {
        t.Fatalf("Failed to extract embedding vector for text '%s': %v", testTexts[i], extractErr)
    }
    if len(vec) == 0 {
        t.Fatalf("Embedding vector is empty for text '%s'", testTexts[i])
    }
    embeddings[i] = vec
}
tests/core-providers/scenarios/utils.go (1)
588-612: Add nil safety checks for err.Error field before dereferencing

The function assumes err.Error is always non-nil after checking err != nil, but since Error is a pointer field (*ErrorField), it can be nil for sparse error payloads. This causes a panic on any dereference. Additionally, compare with the implementation in core/utils.go (line 239), which properly guards with if err.Error != nil checks. Apply similar defensive checks at lines 598, 601, and 604. Also trim leading spaces from the formatted string when both errorType and errorCode are empty to avoid output like " : message".
func GetErrorMessage(err *schemas.BifrostError) string {
	if err == nil {
		return ""
	}

	errorType := ""
	if err.Type != nil && *err.Type != "" {
		errorType = *err.Type
	}

	if errorType == "" && err.Error != nil && err.Error.Type != nil && *err.Error.Type != "" {
		errorType = *err.Error.Type
	}

	errorCode := ""
	if err.Error != nil && err.Error.Code != nil && *err.Error.Code != "" {
		errorCode = *err.Error.Code
	}

	errorMessage := ""
	if err.Error != nil {
		errorMessage = err.Error.Message
	}

	errorString := fmt.Sprintf("%s %s: %s", errorType, errorCode, errorMessage)
	return strings.TrimSpace(errorString)
}

🧹 Nitpick comments (17)

core/providers/bedrock/bedrock.go (1)
909-927: Optimize latency assignment to avoid redundant write.

The logic correctly assigns total streaming latency to the final response. However, for the last response, the Latency field is set twice: first at line 916 with time.Since(lastChunkTime).Milliseconds(), then overwritten at line 926 with time.Since(startTime).Milliseconds().

Consider this optimization:
 for i, finalResponse := range finalResponses {
 	finalResponse.ExtraFields = schemas.BifrostResponseExtraFields{
 		RequestType:     schemas.ResponsesStreamRequest,
 		Provider:        providerName,
 		ModelRequested:  request.Model,
 		ModelDeployment: deployment,
 		ChunkIndex:      chunkIndex,
-		Latency:         time.Since(lastChunkTime).Milliseconds(),
 	}
 	chunkIndex++
 	lastChunkTime = time.Now()

 	if providerUtils.ShouldSendBackRawResponse(ctx, provider.sendBackRawResponse) {
 		finalResponse.ExtraFields.RawResponse = "{}" // Final event has no payload
 	}

 	if i == len(finalResponses)-1 {
 		finalResponse.ExtraFields.Latency = time.Since(startTime).Milliseconds()
+	} else {
+		finalResponse.ExtraFields.Latency = time.Since(lastChunkTime).Milliseconds()
 	}

 	providerUtils.ProcessAndSendResponse(ctx, postHookRunner, providerUtils.GetBifrostResponseForStreamResponse(nil, nil, finalResponse, nil, nil), responseChan)
 }
tests/core-providers/scenarios/test_retry_framework.go (1)

940-957: List‑models retry config has an unused conditions hook; consider either wiring it or simplifying for clarity

You’ve introduced:

ListModelsRetryCondition / ListModelsRetryConfig,

DefaultListModelsRetryConfig() returning a TestRetryConfig,

WithListModelsTestRetry that unconditionally retries on any failure (errors, nil response, empty data, validation failures), and never calls checkListModelsRetryConditions.

Given that WithListModelsTestRetry intentionally ignores conditions, the Conditions []ListModelsRetryCondition field and checkListModelsRetryConditions helper are currently unused. That’s not incorrect, but it adds surface area without behavior.

Either:

keep the current “retry on any failure” behavior, but drop the unused Conditions/checkListModelsRetryConditions bits for simplicity, or

explicitly call checkListModelsRetryConditions inside WithListModelsTestRetry if you plan to introduce more nuanced list‑models retry rules later.

Also applies to: 1843-1977

tests/core-providers/scenarios/error_parser.go (1)

379-397: Clarify ❌ handling to truly enforce a prefix

The new comment says “ALWAYS includes ❌ prefix”, but the check uses strings.Contains, so messages that contain ❌ in the middle won’t be prefixed while still not actually starting with it. If you really want a prefix, consider strings.HasPrefix(message, "❌") (or trimming leading whitespace first), and optionally mirror this behavior in AssertNoError for consistency.

tests/core-providers/scenarios/transcription_stream.go (1)

84-114: Retry‑wrapped TTS generation looks good; consider reusing more config

The TTS retry wiring via WithSpeechTestRetry is sound and aligns with the new framework. Two small improvements you might consider:

Tie SpeechExpectations’ min_audio_bytes to the per‑case expectation (e.g., derive from tc.expectMinBytes) instead of hard‑coding 100, to keep validation strength consistent with the main speech tests.

The TestRetryConfig → SpeechRetryConfig mapping pattern (copying max/base/max delay, hooks, empty conditions) is repeated in several places; a small helper to construct SpeechRetryConfig from a TestRetryConfig would reduce duplication and keep future changes localized.

tests/core-providers/scenarios/utils.go (1)

536-569: Solid TTS retry integration; minor reuse opportunities

The helper now correctly delegates to WithSpeechTestRetry and enforces non‑empty audio before writing the temp file, which is a nice robustness boost. Given the same TestRetryConfig → SpeechRetryConfig conversion and SpeechExpectations(100) pattern appear in multiple places, consider extracting a small shared helper (e.g., NewSpeechRetryConfigFromTestConfig) and, if you ever want a stricter minimum size here, taking minBytes as a parameter instead of fixing it at 100.

tests/core-providers/scenarios/chat_completion_stream.go (2)

43-65: ChatCompletionStream retry integration looks good; watch “paris” sensitivity

Using StreamingRetryConfig + WithStreamRetry for the base stream path is a clean upgrade, and the consolidation into a synthetic BifrostChatResponse for validation is preserved. The only thing to double‑check is the new expectations.ShouldContainAnyOf entry of "paris": if ValidateChatResponse does a case‑sensitive check, completions that say "Paris" (capital P) could spuriously fail. If that isn’t already handled, consider normalizing case before comparison or adding "Paris" alongside "paris".

Also applies to: 188-210

242-333: With‑tools streaming retry path is solid; optionally surface structured errors

The with‑tools streaming scenario now correctly wraps ChatCompletionStreamRequest in WithStreamRetry and gives clear “after retries” failures with a ❌ prefix and GetErrorMessage, which is a nice improvement. One small enhancement to consider: when response == nil || response.BifrostChatResponse == nil, you could log any attached response.BifrostError (if present) via FormatErrorConcise(ParseBifrostError(...)) before t.Fatalf, which would make debugging provider‑side failures during streaming with tools much easier.

tests/core-providers/scenarios/speech_synthesis.go (2)

85-121: Base speech synthesis retry flow is correct; centralize SpeechRetryConfig construction

The base SpeechSynthesis subtests now correctly route through WithSpeechTestRetry with per‑scenario TestRetryConfig, expectations, and informative “after retries” failures, while validateSpeechSynthesisSpecific still enforces audio size/model checks. Given the same SpeechRetryConfig construction pattern appears here, in the HD test, voice tests, and GenerateTTSAudioForTest, it would be worth extracting a small helper (e.g., NewSpeechRetryConfig(retryConfig TestRetryConfig)) to avoid repetition and keep retry behavior consistent across all speech scenarios.

192-306: Advanced HD and voice‑type speech tests: good retry usage; consider honoring caller context

The HD long‑text and AllVoiceOptions subtests both use GetTestRetryConfigForScenario + WithSpeechTestRetry correctly and add targeted validations (min bytes, model consistency, per‑voice audio checks), which strengthens these paths. The only behavioral nit is that each block creates a fresh context.Background() (requestCtx) instead of reusing the ctx passed into RunSpeechSynthesisAdvancedTest, so timeouts/cancellations from the higher‑level test harness won’t propagate into the provider calls; if that context is meaningful in your setup, consider threading it through rather than dropping it.

tests/core-providers/scenarios/text_completion.go (1)

52-76: Text completion retry wrapper is wired correctly

The TextCompletion test now cleanly delegates to WithTextCompletionTestRetry with a dedicated TextCompletionRetryConfig, contextual metadata, and scenario expectations, which matches the broader retry framework. Given the conditions slice is currently empty, you can keep it as‑is, or later introduce TextCompletionRetryConditions (e.g., for empty/degenerate outputs) without changing call sites.

tests/core-providers/scenarios/transcription.go (1)

142-177: Unify repeated transcription retry + error-handling pattern

All these blocks follow the same structure (build TranscriptionRetryConfig + TestRetryContext + expectations, call WithTranscriptionTestRetry, then wrap the error with a “❌” prefix and do nil/empty checks). That’s good for consistency but fairly repetitive.

Consider extracting a small helper (e.g. runTranscriptionWithRetry(t, client, ctx, scenarioName, request, expectations, testConfig)) that encapsulates:

retry config/context construction,

WithTranscriptionTestRetry call, and

the common error / nil / empty-text handling.

That would reduce boilerplate and make future tweaks to retry semantics or error formatting one-liner changes.

Also applies to: 231-269, 313-352, 383-421, 453-492

tests/core-providers/scenarios/tool_calls_streaming.go (3)

17-41: Accumulator design is solid; minor cleanup opportunity in API surface

The accumulator logic for both chat and responses streaming:

sensibly prefers stable keys (ID, then itemID, then name/index),

handles incremental vs. “done” full-JSON arguments,

and keeps ItemIDToKey in sync when IDs show up later.

Two minor points:

AccumulateChatToolCall’s choiceIndex parameter is never used; either remove it or incorporate it into the keying strategy if you expect multiple choices with overlapping toolCall.Index values.

The JSON “complete vs. incremental” heuristic in AccumulateResponsesToolCall (replace when argsStr looks like {...} and we already have partials) is reasonable, but consider documenting that this assumes the “done” event sends a full object and that arguments are always object-shaped (not arrays), to make the intent explicit.

Also applies to: 43-92, 94-179, 181-209

356-663: Responses tool-calls streaming handling is thorough; consider slight simplification

The Responses streaming test:

covers all relevant event types (FunctionCallArgumentsDelta, OutputItemAdded, FunctionCallArgumentsDone),

threads through callID, name, and itemID where available,

and leverages the accumulator to reconcile partial deltas with final arguments.

A couple of non-blocking thoughts:

In FunctionCallArgumentsDelta, the “arguments” source selection (Delta vs. Arguments) and verbose logging are great; you might eventually want to reuse a small helper to avoid repeating the callID/name/itemID probing logic across the three cases.

Given that GetFinalResponsesToolCalls and validateStreamingToolCalls already enforce that ID/Name/Arguments are present, the explicit per-tool-call if toolCall.ID == "" || ... block is redundant and could be dropped to keep the flow slimmer.

Functionally this looks correct and aligns with the new accumulator.

666-714: validateStreamingToolCalls behavior and JSON validation semantics

The helper gives strong guarantees (all tool calls must have ID, name, and non-empty arguments) and then attempts to parse arguments as JSON, treating failures as warnings as long as the string isn’t blank. That’s a reasonable compromise for mixed/partial streaming formats.

Given that callers already ensure non-empty arguments before invoking this helper, you could rely solely on the json.Unmarshal + TrimSpace != "" logic here and avoid some duplication with upstream checks, but it’s not strictly necessary.

tests/core-providers/scenarios/validation_presets.go (1)

208-293: GetExpectationsForScenario("ResponsesStream", ...) currently falls back to default and is unused

GetExpectationsForScenario doesn’t define a "ResponsesStream" case, so the "ResponsesStream" call in responses_stream.go currently just returns BasicChatExpectations, later modified and then never actually used in that file.

If you intend to drive validations off scenario-specific expectations for responses streaming, consider either:

adding a "ResponsesStream" case that returns StreamingExpectations(), or

reusing the existing "ChatCompletionStream"/StreamingExpectations path and passing that scenario name from responses_stream.go.

If expectations won’t be used for responses streaming anymore, you might instead remove the dead expectations variable there to avoid confusion.

tests/core-providers/scenarios/responses_stream.go (2)

48-252: Validation-driven retry wrapper is well-structured; a couple of small polish points

The move to WithResponsesStreamValidationRetry with a dedicated validation closure is a nice improvement:

The closure covers core failure modes (nil chunk, explicit error events, timeout, excessive chunks) and returns structured errors instead of calling t.Fatal in most places.

Structural validation is separated (validateResponsesStreamingStructure + validateResponsesStreamingResponse), and the outer test only inspects the final ResponsesStreamValidationResult, which aligns with the retry framework’s design.

Two minor suggestions:

The expectations := GetExpectationsForScenario("ResponsesStream", ...) block is currently unused; either wire those expectations into validation (e.g., content checks) or drop the block to avoid dead code.

You already verify latency on the last chunk inside validateResponsesStreamingResponse; populating validationResult.LastLatency from lastResponse.BifrostResponsesStreamResponse.ExtraFields.Latency (instead of 0) would make the returned result more informative for logging/metrics.

Functionally this looks sound.

815-819: validateResponsesStreamingResponse enforces strong but reasonable streaming guarantees

The new helper:

enforces non-empty, “substantial” final content,

requires that some streaming events and specifically content-related events were seen,

checks sequence-number monotonicity and last-response structure/provider,

warns (but doesn’t fail) on missing created/completed or missing story-like tokens,

and now treats missing latency on the last chunk as a hard error.

This is a good encapsulation of streaming correctness. Just be aware that the latency requirement is now a strict pass/fail gate for all providers using this path; if any provider legitimately omits latency, you’ll need to either relax this based on provider or route those providers through a different expectation preset.

Would you like me to suggest a small adjustment (e.g., making the latency check conditional on provider capabilities or a flag in testConfig) if you find providers that don’t currently emit latency?

Also applies to: 821-917

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8661443 and 2b3d485.

📒 Files selected for processing (31)

core/providers/bedrock/bedrock.go (2 hunks)
core/providers/gemini/utils.go (2 hunks)
core/providers/openai/openai.go (1 hunks)
framework/streaming/accumulator.go (3 hunks)
framework/streaming/responses.go (0 hunks)
plugins/mocker/main.go (3 hunks)
tests/core-providers/scenarios/chat_completion_stream.go (5 hunks)
tests/core-providers/scenarios/complete_end_to_end.go (0 hunks)
tests/core-providers/scenarios/embedding.go (1 hunks)
tests/core-providers/scenarios/end_to_end_tool_calling.go (0 hunks)
tests/core-providers/scenarios/error_parser.go (2 hunks)
tests/core-providers/scenarios/image_base64.go (0 hunks)
tests/core-providers/scenarios/image_url.go (0 hunks)
tests/core-providers/scenarios/list_models.go (3 hunks)
tests/core-providers/scenarios/multi_turn_conversation.go (0 hunks)
tests/core-providers/scenarios/multiple_images.go (0 hunks)
tests/core-providers/scenarios/reasoning.go (0 hunks)
tests/core-providers/scenarios/response_validation.go (10 hunks)
tests/core-providers/scenarios/responses_stream.go (9 hunks)
tests/core-providers/scenarios/speech_synthesis.go (3 hunks)
tests/core-providers/scenarios/speech_synthesis_stream.go (1 hunks)
tests/core-providers/scenarios/test_retry_framework.go (28 hunks)
tests/core-providers/scenarios/text_completion.go (1 hunks)
tests/core-providers/scenarios/text_completion_stream.go (3 hunks)
tests/core-providers/scenarios/tool_calls_streaming.go (4 hunks)
tests/core-providers/scenarios/transcription.go (7 hunks)
tests/core-providers/scenarios/transcription_stream.go (1 hunks)
tests/core-providers/scenarios/utils.go (1 hunks)
tests/core-providers/scenarios/validation_presets.go (2 hunks)
tests/core-providers/tests.go (1 hunks)
transports/bifrost-http/lib/config.go (13 hunks)

💤 Files with no reviewable changes (8)

tests/core-providers/scenarios/reasoning.go
framework/streaming/responses.go
tests/core-providers/scenarios/multi_turn_conversation.go
tests/core-providers/scenarios/complete_end_to_end.go
tests/core-providers/scenarios/image_base64.go
tests/core-providers/scenarios/end_to_end_tool_calling.go
tests/core-providers/scenarios/image_url.go
tests/core-providers/scenarios/multiple_images.go

🚧 Files skipped from review as they are similar to previous changes (5)

plugins/mocker/main.go
tests/core-providers/tests.go
framework/streaming/accumulator.go
transports/bifrost-http/lib/config.go
core/providers/gemini/utils.go

🧰 Additional context used

🧬 Code graph analysis (11)

core/providers/openai/openai.go (1)

core/schemas/responses.go (2)

ResponsesStreamResponseTypeCompleted (1359-1359)

ResponsesStreamResponseTypeIncomplete (1361-1361)

core/providers/bedrock/bedrock.go (1)

plugins/mocker/main.go (1)

Latency (126-130)

tests/core-providers/scenarios/text_completion_stream.go (1)

tests/core-providers/scenarios/test_retry_framework.go (3)

StreamingRetryConfig (821-841)

TestRetryContext (170-175)

WithStreamRetry (582-690)

tests/core-providers/scenarios/speech_synthesis.go (4)

tests/core-providers/scenarios/test_retry_framework.go (5)

GetTestRetryConfigForScenario (1118-1150)

TestRetryContext (170-175)

SpeechRetryConfig (218-225)

SpeechRetryCondition (146-149)

WithSpeechTestRetry (1328-1478)

core/schemas/provider.go (1)

Provider (207-234)

core/schemas/speech.go (1)

BifrostSpeechResponse (22-29)

core/utils.go (1)

GetErrorMessage (239-276)

tests/core-providers/scenarios/speech_synthesis_stream.go (3)

tests/core-providers/scenarios/test_retry_framework.go (2)

WithSpeechStreamValidationRetry (1990-2128)

SpeechStreamValidationResult (1980-1986)

core/schemas/bifrost.go (2)

BifrostStream (318-325)

BifrostError (353-362)

core/schemas/speech.go (1)

BifrostSpeechStreamResponse (133-138)

tests/core-providers/scenarios/responses_stream.go (2)

tests/core-providers/scenarios/test_retry_framework.go (5)

WithResponsesStreamValidationRetry (2141-2290)

ResponsesStreamValidationResult (2131-2137)

StreamingRetryConfig (821-841)

TestRetryContext (170-175)

WithStreamRetry (582-690)

core/schemas/bifrost.go (3)

BifrostStream (318-325)

BifrostError (353-362)

ResponsesStreamRequest (92-92)

tests/core-providers/scenarios/tool_calls_streaming.go (4)

core/schemas/chatcompletions.go (4)

BifrostChatRequest (11-18)

ChatParameters (154-183)

ChatTool (201-205)

BifrostChatResponse (25-40)

tests/core-providers/scenarios/test_retry_framework.go (2)

StreamingRetryConfig (821-841)

TestRetryContext (170-175)

core/schemas/bifrost.go (2)

BifrostStream (318-325)

BifrostError (353-362)

core/schemas/responses.go (1)

BifrostResponsesStreamResponse (1421-1459)

tests/core-providers/scenarios/test_retry_framework.go (5)

tests/core-providers/scenarios/response_validation.go (1)

ResponseExpectations (18-43)

tests/core-providers/scenarios/utils.go (1)

GetErrorMessage (588-612)

core/utils.go (1)

GetErrorMessage (239-276)

tests/core-providers/scenarios/test_retry_conditions.go (4)

StreamErrorCondition (537-537)

IncompleteStreamCondition (582-582)

EmptySpeechCondition (667-667)

EmptyResponseCondition (16-16)

tests/core-providers/scenarios/error_parser.go (1)

ParseBifrostError (47-104)

tests/core-providers/scenarios/response_validation.go (3)

core/schemas/models.go (1)

BifrostListModelsResponse (36-45)

core/schemas/bifrost.go (3)

BifrostError (353-362)

RequestType (83-83)

ListModelsRequest (86-86)

tests/core-providers/scenarios/error_parser.go (3)

ParseBifrostError (47-104)

FormatErrorConcise (353-356)

LogError (359-366)

tests/core-providers/scenarios/utils.go (3)

tests/core-providers/scenarios/test_retry_framework.go (5)

DefaultSpeechRetryConfig (860-873)

TestRetryContext (170-175)

SpeechRetryConfig (218-225)

SpeechRetryCondition (146-149)

WithSpeechTestRetry (1328-1478)

tests/core-providers/scenarios/validation_presets.go (2)

SpeechExpectations (148-164)

ModifyExpectationsForProvider (301-348)

core/utils.go (1)

GetErrorMessage (239-276)

tests/core-providers/scenarios/embedding.go (4)

tests/core-providers/scenarios/test_retry_framework.go (3)

GetTestRetryConfigForScenario (1118-1150)

EmbeddingRetryConfig (238-245)

WithEmbeddingTestRetry (1503-1653)

tests/core-providers/scenarios/validation_presets.go (2)

EmbeddingExpectations (103-115)

ModifyExpectationsForProvider (301-348)

core/schemas/bifrost.go (2)

BifrostError (353-362)

EmbeddingRequest (93-93)

tests/core-providers/scenarios/utils.go (1)

GetErrorMessage (588-612)

🔇 Additional comments (11)

core/providers/bedrock/bedrock.go (1)

909-927: PR metadata inconsistency detected.

The PR title indicates "retry logic enhancements" and references Issue #123 about "Files API Support," but the actual changes modify latency reporting for streaming responses. This mismatch suggests the PR description may be incorrect or incomplete.

Please verify:

Is this the correct PR description for these changes?

Should the PR title reflect the actual latency reporting changes?

Is Issue #123 the correct linked issue, or should it reference a different issue about streaming latency improvements?

core/providers/openai/openai.go (1)

1363-1368: Verify intentional behavior difference and test coverage for Incomplete response handling.

The logic change is sound—Incomplete responses should terminate the stream. However, verification uncovered a potential inconsistency:

HandleOpenAIChatCompletionStreaming (line 935) terminates only on Completed

HandleOpenAIResponsesStreaming (line 1363) now terminates on Completed OR Incomplete

Please confirm:

Is this intentional architectural difference between the two handlers, or should they behave consistently?

Do tests cover the Incomplete response type terminating the stream at line 1363?

Why is Failed response type defined but not triggering stream termination—should it also end the stream?

tests/core-providers/scenarios/text_completion_stream.go (1)

5-5: Text completion streaming tests now correctly reuse the streaming retry wrapper

fmt import is justified by the new fmt.Sprintf‑based scenario names, and both the varied‑prompts and parameters subtests correctly wrap TextCompletionStreamRequest with WithStreamRetry using StreamingRetryConfig and rich TestRetryContext metadata. This keeps streaming tests consistent with the global retry strategy without changing validation semantics.

Also applies to: 253-270, 395-412

tests/core-providers/scenarios/response_validation.go (1)

402-445: Content keyword/pattern checks becoming warnings + ❌‑prefixing errors is appropriate for flaky LLM outputs

Downgrading ShouldContainKeywords / ShouldContainAnyOf / forbidden words / regex mismatches to warnings (with truncated actual content) for chat, text completion, and responses, while still failing on truly empty content, is a sensible tradeoff for non‑deterministic LLM behavior. The logValidationResults change that ensures each error line gets a ❌ prefix keeps logs readable, especially when combined with the new retry framework.

Also applies to: 588-633, 734-777, 1355-1360

tests/core-providers/scenarios/embedding.go (1)

75-101: Embedding test now correctly routes through the generic retry framework

The embedding scenario now:

derives a TestRetryConfig from GetTestRetryConfigForScenario("Embedding", ...),

maps its attempt/backoff and hooks into an EmbeddingRetryConfig,

wraps EmbeddingRequest in WithEmbeddingTestRetry with provider‑aware expectations.

This keeps embedding tests aligned with the rest of the retry/validation framework while still allowing embedding‑specific semantic checks afterward.

Also applies to: 103-109

tests/core-providers/scenarios/test_retry_framework.go (1)

16-41: Retry framework + timeout handling are coherent and consistent across test domains

The refactored retry framework:

documents the “validation failures always trigger retries” rule clearly,

unifies retry configs per domain (chat/text/responses/speech/transcription/embedding/list‑models),

consistently:

validates responses first,

always retries on timeouts via isTimeoutError,

then uses domain‑specific check*RetryConditions only as secondary signals,

prefixes validation/timeout messages with ❌ for easier log scanning,

applies exponential backoff with sensible caps.

WithChatTestRetry, WithResponsesTestRetry, WithStreamRetry, and the dual‑API wrapper all follow this pattern without introducing obvious logic or concurrency issues. This is a solid foundation for stabilizing flaky LLM tests.

Also applies to: 275-426, 429-579, 581-690, 940-957, 968-1115

tests/core-providers/scenarios/list_models.go (1)

31-52: List‑models tests correctly leverage the new retry helper and keep strong post‑retry assertions

Both RunListModelsTest and RunListModelsPaginationTest now:

derive a scenario‑specific TestRetryConfig via GetTestRetryConfigForScenario,

map it into a ListModelsRetryConfig (retry‑on‑any‑failure) for WithListModelsTestRetry,

pass clear TestRetryContext metadata,

enforce structural correctness after retries (non‑nil response, non‑empty data where appropriate, page size constraints) and reasonable latency bounds,

surface final failures with explicit ❌ messages using GetErrorMessage.

Once the expected_provider type alignment is fixed, this should give robust, low‑flake coverage for both listing and pagination behavior.

Also applies to: 131-167, 183-209

tests/core-providers/scenarios/transcription.go (1)

80-112: TTS retry integration looks consistent and safe

The SpeechSynthesis retry wiring (config, context, expectations, and WithSpeechTestRetry call) is coherent and maintains clear failure semantics (t.Fatalf on final error plus explicit empty-audio guard). No functional issues from this block on its own.

tests/core-providers/scenarios/tool_calls_streaming.go (1)

230-342: Chat tool-calls streaming test uses accumulator and retry correctly

The chat completions streaming test:

uses StreamingRetryConfig + WithStreamRetry appropriately,

accumulates tool calls chunk-by-chunk via the accumulator,

and asserts non-nil channel plus at-least-one-chunk before validating final tool calls.

The additional per-chunk logging (hasID/hasName/hasArgs plus lengths) should be very helpful when debugging provider-specific streaming quirks. No functional issues spotted here.

tests/core-providers/scenarios/validation_presets.go (1)

15-29: Latency expectation wiring is consistent across presets

Adding ShouldHaveLatency: true to the basic, embedding, speech, and transcription presets, and then threading it through CombineExpectations ensures latency presence is treated as a first-class expectation across scenarios.

Given that the validation logic in streaming tests now asserts latency on the last chunk, this alignment looks intentional and consistent. No issues here.

Also applies to: 103-115, 147-188, 387-447

tests/core-providers/scenarios/responses_stream.go (1)

307-327: Consistent use of StreamingRetryConfig + WithStreamRetry in auxiliary streaming tests

For the ResponsesStreamWithTools, ResponsesStreamWithReasoning, and ResponsesStreamLifecycle subtests, switching to StreamingRetryConfig with WithStreamRetry keeps retry behavior and logging consistent with the rest of the streaming suite, while preserving the existing per-chunk assertions.

The retry contexts also capture relevant metadata (provider/model/flags) for debugging. No issues here.

Also applies to: 440-459, 565-583

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/core-providers/scenarios/transcription_stream.go (1)
176-321: Stream transcription only accumulates Delta, ignoring Text, which can cause false failures

Inside the main streaming loop you only append transcribeData.Delta to fullTranscriptionText, while transcribeData.Text is only logged. If a provider streams via Text (or sets Text on final chunks) and never uses Delta, fullTranscriptionText will remain empty and the later checks:

if fullTranscriptionText == "" { ... }

word‑match round‑trip validation

will incorrectly fail even though valid transcription data was received.

Consider also appending non-empty Text when building fullTranscriptionText, e.g.:
-                        // Handle delta vs complete text chunks
-                        if transcribeData.Delta != nil {
-                            // This is a delta chunk
-                            deltaText := *transcribeData.Delta
-                            fullTranscriptionText += deltaText
-                            t.Logf("✅ Received transcription delta chunk %d with latency %d ms: '%s'", chunkIndex, response.BifrostTranscriptionStreamResponse.ExtraFields.Latency, deltaText)
-                        }
+                        // Handle delta vs complete text chunks
+                        if transcribeData.Delta != nil {
+                            // This is a delta chunk
+                            deltaText := *transcribeData.Delta
+                            fullTranscriptionText += deltaText
+                            t.Logf("✅ Received transcription delta chunk %d with latency %d ms: '%s'",
+                                chunkIndex,
+                                response.BifrostTranscriptionStreamResponse.ExtraFields.Latency,
+                                deltaText,
+                            )
+                        } else if transcribeData.Text != "" {
+                            // Fallback for providers that stream full text instead of deltas
+                            fullTranscriptionText += transcribeData.Text
+                        }
This keeps current delta behavior but also supports text-based streaming providers.
tests/core-providers/scenarios/embedding.go (1)
135-144: Fix invalid range and potential out-of-bounds in embedding validation loop

validateEmbeddingSemantics currently has:
responseDataLength := len(response.Data)
// ...
for i := range responseDataLength {
    vec, extractErr := getEmbeddingVector(response.Data[i])
    // ...
}
Two problems:

range over an int does not compile in Go.

responseDataLength may be reassigned to len(response.Data[0].Embedding.Embedding2DArray) while len(response.Data) is still 1; indexing response.Data[i] for i >= len(response.Data) will panic.

A minimal, safer fix is to iterate over the actual slice and ignore responseDataLength in the loop:
-	for i := range responseDataLength {
-		vec, extractErr := getEmbeddingVector(response.Data[i])
+	for i := 0; i < responseDataLength; i++ {
+		if i >= len(response.Data) {
+			t.Fatalf("Embedding response shape mismatch: response.Data has length %d but expected at least %d elements", len(response.Data), responseDataLength)
+		}
+		vec, extractErr := getEmbeddingVector(response.Data[i])
Depending on how 2D embeddings are represented in EmbeddingData, you may instead want to branch explicitly between “flat Data per input” and “single Data item with Embedding2DArray” and extract vectors from the correct structure, but the current form will not even compile and can panic on 2D cases.

🧹 Nitpick comments (19)

tests/core-providers/tests.go (1)
104-104: Consider removing duplicate status bracketing.

The status indicator appears twice on each line: once at the beginning and once after the scenario name. For example, line 104 produces output like [ENABLED] SUPPORTED: TextCompletion [ENABLED] Configured to run. The second occurrence seems redundant, especially since the summary lines (113-115) only use single bracketing.

Apply this diff to simplify the output:
-			t.Logf("[ENABLED]  SUPPORTED:   %-25s [ENABLED]  Configured to run", scenario.name)
+			t.Logf("[ENABLED]  SUPPORTED:   %-25s Configured to run", scenario.name)
-			t.Logf("[SKIPPED]  UNSUPPORTED: %-25s [SKIPPED]  Not supported by provider", scenario.name)
+			t.Logf("[SKIPPED]  UNSUPPORTED: %-25s Not supported by provider", scenario.name)
Also applies to: 107-107
core/providers/gemini/utils.go (1)

416-431: Nil-guard fixes for Gemini message content are correct; consider centralizing text extraction

The additional message.Content != nil checks around both standard content and tool-response handling remove a real nil-deref risk when Content is absent while preserving existing behavior for ContentStr and text ContentBlocks. This is a solid fix.

Optionally, the repeated “extract text from ChatMessageContent / ContentBlocks” logic in these two branches could be pulled into a small helper (e.g., extractTextFromChatContent(...)) to keep future changes in one place.

Also applies to: 477-493

tests/core-providers/scenarios/speech_synthesis.go (1)

279-314: Per‑voice speech synthesis tests are now robustly retried

The AllVoiceOptions subtests now use WithSpeechTestRetry with a voice‑specific SpeechRetryConfig and TestRetryContext, plus minimal audio‑size thresholds. This should significantly reduce flaky voice‑specific failures while still failing when a voice delivers no or too‑small audio.

Consider extracting a small helper to build SpeechRetryConfig from a TestRetryConfig to avoid repeating the same field-copy boilerplate in multiple places.

tests/core-providers/scenarios/speech_synthesis_stream.go (1)

375-377: Streaming multi‑voice test: loop capture fix and validation‑driven retries look solid

The voiceCopy := voice pattern and exclusive use of voiceCopy inside the parallel subtest correctly fix the prior loop‑variable capture issue. The new WithSpeechStreamValidationRetry integration plus the per‑voice validator—which checks for stream errors, presence of audio data, and non‑zero LastLatency—provides a clear, retry‑aware pass/fail signal per voice. This is a clean and robust upgrade to the streaming speech tests.

If you see similar validators emerging elsewhere, it may be worth extracting a reusable “basic speech stream validator” helper that takes a voice label, but it’s fine as‑is for now.

Also applies to: 397-408, 412-469, 471-477

tests/core-providers/scenarios/transcription.go (1)

80-103: Retry integration for speech/transcription tests looks correct; consider deduplicating config mapping

The new use of GetTestRetryConfigForScenario + WithSpeechTestRetry / WithTranscriptionTestRetry is consistent with the shared framework and the expectations objects, and the error handling with GetErrorMessage + ❌ prefixing is coherent.

One small improvement: every call site manually maps TestRetryConfig → SpeechRetryConfig / TranscriptionRetryConfig with identical field copies and Conditions: []…{}. A thin helper (e.g., NewSpeechRetryConfigFromTestConfig(...), NewTranscriptionRetryConfigFromTestConfig(...)) would reduce repetition and make it harder to forget to propagate new fields or conditions in the future.

Also applies to: 142-171, 231-269, 313-335, 383-421, 453-492
tests/core-providers/scenarios/transcription_stream.go (1)
147-167: Streaming transcription retry configs use generic defaults for some scenarios

For the main round-trip stream test you call:
retryConfig := GetTestRetryConfigForScenario("TranscriptionStream", testConfig)
which maps to DefaultTranscriptionRetryConfig(). For JSON/lang/prompt variants you use scenario names like "TranscriptionStreamJSON", "TranscriptionStreamLang", "TranscriptionStreamPrompt" that currently fall through to DefaultTestRetryConfig().

Given StreamingRetryConfig() exists with stream-specific conditions (StreamErrorCondition, IncompleteStreamCondition) and comments explicitly avoiding EmptyResponseCondition for streams, it may be clearer and more consistent to:

Either add dedicated cases for these "TranscriptionStream*" scenario names in GetTestRetryConfigForScenario that return StreamingRetryConfig(), or

Bypass GetTestRetryConfigForScenario here and call StreamingRetryConfig() directly, as is done in some other streaming tests.

Not a blocker, but tightening this mapping would make streaming retry behavior more predictable across scenarios.

Also applies to: 357-373, 448-463, 530-547
tests/core-providers/scenarios/test_retry_framework.go (1)

257-273: Retry framework is coherent; consider tightening streaming configs and error handling assumptions

The expanded framework (validation‑first retry wrappers, timeout‑aware backoff, unified logging with ❌ prefixes) is internally consistent and a good fit for flaky LLM behavior across chat, responses, embeddings, transcription, and list models.

Two things to keep in mind as this grows:

Streaming configs vs conditions

WithStreamRetry, WithSpeechStreamValidationRetry, and WithResponsesStreamValidationRetry assume that TestRetryConfig.Conditions are safe to evaluate in a streaming context (only have err and TestRetryContext, no single response object).

StreamingRetryConfig() follows that contract (only StreamErrorCondition / IncompleteStreamCondition) and explicitly avoids EmptyResponseCondition.

Ensure all stream callers either:

Use StreamingRetryConfig(), or

Provide a TestRetryConfig whose conditions are stream‑safe; otherwise, condition evaluators written for non‑stream responses may panic or behave unexpectedly when called with nil responses.

Defensive error parsing for isTimeoutError

isTimeoutError now gates the “ALWAYS retry” path across many wrappers and calls GetErrorMessage(err) from utils.go. That helper assumes err.Error is non‑nil and fully populated.

If any code path produces a BifrostError with a nil Error field, the new, more frequent use of GetErrorMessage (including inside streaming retry logic) could start panicking tests that previously only logged err.Error.Message conditionally.

Consider hardening GetErrorMessage (and/or isTimeoutError) with additional err.Error != nil checks before accessing nested fields to make the retry framework robust against incomplete error objects.

These are refinements rather than blockers; the overall structure and cross‑type consistency of the retry logic look solid.

Also applies to: 581-689, 710-764, 820-841, 896-910, 940-957, 1117-1149, 1843-1966, 1980-2128, 2130-2289
tests/core-providers/scenarios/embedding.go (1)
75-103: Embedding retry wiring is consistent; consider preserving embedding-specific conditions

The use of GetTestRetryConfigForScenario("Embedding", ...) and WithEmbeddingTestRetry is consistent with the shared retry framework, and the expectations wiring looks good.

Right now, though, EmbeddingRetryConfig is constructed with an empty Conditions slice:
embeddingRetryConfig := EmbeddingRetryConfig{
	MaxAttempts: retryConfig.MaxAttempts,
	BaseDelay:   retryConfig.BaseDelay,
	MaxDelay:    retryConfig.MaxDelay,
	Conditions:  []EmbeddingRetryCondition{}, // ...
	OnRetry:     retryConfig.OnRetry,
	OnFinalFail: retryConfig.OnFinalFail,
}
If you intended to reuse any embedding‑specific retry conditions from DefaultEmbeddingRetryConfig (e.g., empty/invalid embeddings), consider either:

Exposing them as []EmbeddingRetryCondition and populating Conditions here, or

Removing the field from EmbeddingRetryConfig until you actually need per‑scenario conditions, to avoid implying that something is wired when it isn’t yet.

Not blocking, just a clarity/DRY tweak.
tests/core-providers/scenarios/chat_completion_stream.go (2)

43-68: Retry wrapper integration for basic chat streaming looks correct

StreamingRetryConfig + WithStreamRetry are wired correctly and preserve the original ChatCompletionStreamRequest behavior while adding retries.

Post‑retry checks (RequireNoError, responseChannel != nil) are sufficient before entering the read loop.

Timeout handling via streamCtx/select is good and prevents hangs.

You might optionally switch the remaining t.Fatal(...)s here (Lines 65, 90, 141, 145, 197, 201, 205) to t.Fatalf with a ❌ prefix for consistency with other tests, but behavior is already fine.

Also applies to: 73-88

242-333: Tool‑call streaming with retries is well‑structured; consider adding a timeout guard

Good use of StreamingRetryConfig + WithStreamRetry and detailed error reporting via GetErrorMessage and ❌‑prefixed messages.

Validation that response != nil and that at least one chunk and at least one tool call were seen is solid.

Logging of tool IDs, names, and arguments will be very helpful when debugging provider behavior.

One gap versus RunChatCompletionStreamTest: this loop has no per‑test timeout context and relies on the channel eventually closing or the global go test timeout. Mirroring the streamCtx timeout pattern (like Lines 73‑146 in this file) would prevent hangs if a provider never closes the stream.

tests/core-providers/scenarios/text_completion.go (1)

36-38: Text completion retry wiring and validation look sound; minor optional safety

GetTestRetryConfigForScenario("TextCompletion", ...) → TextCompletionRetryConfig → WithTextCompletionTestRetry is consistent with the shared retry framework.

Expectations are passed into the retry helper so validation happens per‑attempt, which is what you want.

GetTextCompletionContent(response) is only called after checking bifrostErr != nil, which matches WithTextCompletionTestRetry’s contract of returning a non‑nil response on success.

If you ever loosen that contract, adding a defensive if response == nil { t.Fatalf(...) } before calling GetTextCompletionContent would make the test more robust, but it’s not required with the current helper semantics.

Also applies to: 51-77

tests/core-providers/scenarios/list_models.go (1)

131-152: Pagination + next‑page retry flow is coherent; double‑check semantics for empty pages

Pagination test now uses the same retry wrapper for both the first page and the next page, with dedicated contexts and expectations.

For pagination expectations you explicitly allow "min_model_count": 0, which is sensible when page_size can exceed the total model count.

Note that WithListModelsTestRetry also retries on empty Data and treats that as a “❌ empty model list” reason; after retries, the test itself does not fail on len(response.Data) == 0, so a genuinely empty page can still be considered success (with warnings/logs).

If the intent is that an empty next page is fully acceptable (not just logged as a final‑failure reason), you might want to relax the “empty model list” retry condition when min_model_count is 0 for this scenario, to avoid noisy retries in the perfectly‑fine “no more pages” case.

Also applies to: 154-167, 183-224

tests/core-providers/scenarios/tool_calls_streaming.go (3)

17-41: Streaming tool‑call accumulator design is reasonable; watch for duplicate entries

Chat accumulator uses either tool‑call ID (preferred) or index as the key and correctly merges name/ID/arguments across chunks.

Responses accumulator’s keying logic (itemID → "item:"+id" mapping, then migrating to callID when available) is flexible enough to handle different provider event shapes.

Because temporary keys like "default" or "name:..." are not always removed when migrating to callID, GetFinalResponsesToolCalls can return duplicates pointing to the same logical call.

The duplicates don’t break validation (they all still must have ID/Name/Args), but if you start asserting on counts, consider deduplicating by ID in GetFinalResponsesToolCalls to keep metrics and logs cleaner.

Also applies to: 43-92, 181-209

218-342: Chat Completions streaming tool‑call test is robust; add timeout like other stream tests

StreamingRetryConfig + WithStreamRetry usage is consistent, and you validate:

at least one streaming response,

presence of at least one tool call,

completeness of each final tool call (ID, name, arguments) via validateStreamingToolCalls.

Per‑chunk logging (ID/name/args presence + lengths) is excellent for debugging partial or malformed provider events.

As with RunChatCompletionStreamTest, this for range responseChannel loop lacks a timeout and could hang if the stream never closes or stalls. Adopting a context.WithTimeout + select pattern here (and in the Responses test below) would improve resilience of the test suite.

350-663: Responses API streaming tool‑call accumulation and validation are thorough

The test now:

Uses WithStreamRetry with a dedicated context and expectations.

Handles all three relevant event types (FunctionCallArgumentsDelta, OutputItemAdded, FunctionCallArgumentsDone) and unifies arguments via the accumulator.

Logs rich debug info for each chunk (which fields are present, where they came from, argument lengths, IDs, names, item IDs).

validateStreamingToolCalls enforces that every final tool call has non‑empty ID/name/arguments and attempts to parse arguments as JSON while only warning on parse failures, which is an appropriate compromise for streaming partials.

Same as above, this loop relies on channel closure or global test timeout; adding an explicit per‑test timeout would prevent hangs when a provider misbehaves.

Also applies to: 666-714

tests/core-providers/scenarios/response_validation.go (1)

45-53: Extended ToolCallExpectation + validateSingleToolCall look coherent; be aware of ValidateArgsJSON semantics

New fields (ArgumentTypes, ArgumentValues, ValidateArgsJSON) give you fine‑grained control over argument validation.

validateSingleToolCall handles multiple shapes for arguments (map, map[interface{}]interface{}, string, []byte) and only attempts JSON parsing when ValidateArgsJSON is true, which is appropriate.

Note that even when ValidateArgsJSON is false, the later checks for RequiredArgs, ForbiddenArgs, ArgumentTypes, and ArgumentValues still operate on args, which will be nil unless you’ve manually provided a parsed map. In practice that’s fine as long as you only set those expectations when ValidateArgsJSON is true; if you plan to support non‑JSON argument formats, you may want to short‑circuit those checks when ValidateArgsJSON is false.

Also applies to: 1255-1345
tests/core-providers/scenarios/responses_stream.go (3)
224-241: expectations for ResponsesStream are built but never used

The expectations value created here is not fed into any validation or retry logic, which can be misleading and adds a bit of noise.

If these expectations are no longer part of the streaming validation surface, consider removing this block to keep the callback focused and avoid implying additional assertions that do not actually run.
-        // Enhanced validation expectations for responses streaming
-        expectations := GetExpectationsForScenario("ResponsesStream", testConfig, map[string]interface{}{})
-        expectations = ModifyExpectationsForProvider(expectations, testConfig.Provider)
-        expectations.ShouldContainKeywords = append(expectations.ShouldContainKeywords, []string{"paris"}...) // Should include story elements
-
774-813: Sequence-number check uses Fatalf but is labeled as a warning and bypasses retry

validateResponsesStreamingStructure calls t.Fatalf when sequence numbers regress, but the log string is prefixed with “⚠️ Warning”, and this hard-fails the test rather than surfacing as a validation failure through WithResponsesStreamValidationRetry.

If the intent is to keep this a hard failure, it would be clearer either to adjust the message to indicate an error or to move this check into the validation-result path so the retry wrapper can decide whether to retry or fail.

815-917: Streaming response validation is thorough; consider softening the latency requirement if needed

The validateResponsesStreamingResponse checks cover content presence, event types, sequence ordering, provider consistency, and latency on the last chunk. That’s a good level of scrutiny.

One potential tuning: the requirement that ExtraFields.Latency > 0 will cause every provider that doesn’t populate latency to fail validation (and thus be retried / eventually fail). If any providers legitimately omit or zero this field, you might want to downgrade this to a logged warning instead of a hard error:
-    if lastResponse != nil && lastResponse.BifrostResponsesStreamResponse != nil {
-        if lastResponse.BifrostResponsesStreamResponse.ExtraFields.Latency <= 0 {
-            errors = append(errors, fmt.Sprintf("Last streaming chunk missing latency information (got %d ms)", lastResponse.BifrostResponsesStreamResponse.ExtraFields.Latency))
-        } else {
-            t.Logf("✅ Total streaming latency: %d ms", lastResponse.BifrostResponsesStreamResponse.ExtraFields.Latency)
-        }
-    }
+    if lastResponse != nil && lastResponse.BifrostResponsesStreamResponse != nil {
+        if lastResponse.BifrostResponsesStreamResponse.ExtraFields.Latency <= 0 {
+            t.Logf("⚠️ Warning: Last streaming chunk missing latency information (got %d ms)", lastResponse.BifrostResponsesStreamResponse.ExtraFields.Latency)
+        } else {
+            t.Logf("✅ Total streaming latency: %d ms", lastResponse.BifrostResponsesStreamResponse.ExtraFields.Latency)
+        }
+    }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b3d485 and 78cb6be.

📒 Files selected for processing (34)

core/changelog.md (1 hunks)
core/providers/bedrock/bedrock.go (2 hunks)
core/providers/gemini/utils.go (2 hunks)
core/providers/openai/openai.go (1 hunks)
framework/changelog.md (1 hunks)
framework/streaming/accumulator.go (3 hunks)
framework/streaming/responses.go (0 hunks)
plugins/mocker/main.go (3 hunks)
tests/core-providers/scenarios/chat_completion_stream.go (5 hunks)
tests/core-providers/scenarios/complete_end_to_end.go (0 hunks)
tests/core-providers/scenarios/embedding.go (1 hunks)
tests/core-providers/scenarios/end_to_end_tool_calling.go (0 hunks)
tests/core-providers/scenarios/error_parser.go (2 hunks)
tests/core-providers/scenarios/image_base64.go (0 hunks)
tests/core-providers/scenarios/image_url.go (0 hunks)
tests/core-providers/scenarios/list_models.go (3 hunks)
tests/core-providers/scenarios/multi_turn_conversation.go (0 hunks)
tests/core-providers/scenarios/multiple_images.go (0 hunks)
tests/core-providers/scenarios/reasoning.go (0 hunks)
tests/core-providers/scenarios/response_validation.go (10 hunks)
tests/core-providers/scenarios/responses_stream.go (9 hunks)
tests/core-providers/scenarios/speech_synthesis.go (3 hunks)
tests/core-providers/scenarios/speech_synthesis_stream.go (2 hunks)
tests/core-providers/scenarios/test_retry_framework.go (28 hunks)
tests/core-providers/scenarios/text_completion.go (1 hunks)
tests/core-providers/scenarios/text_completion_stream.go (3 hunks)
tests/core-providers/scenarios/tool_calls_streaming.go (4 hunks)
tests/core-providers/scenarios/transcription.go (7 hunks)
tests/core-providers/scenarios/transcription_stream.go (1 hunks)
tests/core-providers/scenarios/utils.go (1 hunks)
tests/core-providers/scenarios/validation_presets.go (2 hunks)
tests/core-providers/tests.go (1 hunks)
transports/bifrost-http/lib/config.go (13 hunks)
transports/changelog.md (1 hunks)

💤 Files with no reviewable changes (8)

framework/streaming/responses.go
tests/core-providers/scenarios/reasoning.go
tests/core-providers/scenarios/image_base64.go
tests/core-providers/scenarios/image_url.go
tests/core-providers/scenarios/end_to_end_tool_calling.go
tests/core-providers/scenarios/complete_end_to_end.go
tests/core-providers/scenarios/multi_turn_conversation.go
tests/core-providers/scenarios/multiple_images.go

✅ Files skipped from review due to trivial changes (1)

transports/changelog.md

🚧 Files skipped from review as they are similar to previous changes (5)

tests/core-providers/scenarios/validation_presets.go
core/providers/openai/openai.go
tests/core-providers/scenarios/error_parser.go
framework/streaming/accumulator.go
plugins/mocker/main.go

🧰 Additional context used

🧬 Code graph analysis (11)

tests/core-providers/scenarios/utils.go (5)

tests/core-providers/scenarios/test_retry_framework.go (5)

DefaultSpeechRetryConfig (860-873)

TestRetryContext (170-175)

SpeechRetryConfig (218-225)

SpeechRetryCondition (146-149)

WithSpeechTestRetry (1328-1478)

tests/core-providers/scenarios/validation_presets.go (1)

SpeechExpectations (148-164)

core/schemas/speech.go (1)

BifrostSpeechResponse (22-29)

core/schemas/bifrost.go (2)

BifrostError (353-362)

SpeechRequest (94-94)

core/utils.go (1)

GetErrorMessage (239-276)

tests/core-providers/scenarios/text_completion_stream.go (1)

tests/core-providers/scenarios/test_retry_framework.go (3)

StreamingRetryConfig (821-841)

TestRetryContext (170-175)

WithStreamRetry (582-690)

tests/core-providers/scenarios/text_completion.go (5)

tests/core-providers/scenarios/test_retry_framework.go (3)

TextCompletionRetryConfig (198-205)

TextCompletionRetryCondition (134-137)

WithTextCompletionTestRetry (1175-1325)

plugins/mocker/main.go (1)

Conditions (89-94)

core/schemas/textcompletions.go (1)

BifrostTextCompletionResponse (69-77)

core/schemas/bifrost.go (2)

BifrostError (353-362)

TextCompletionRequest (87-87)

tests/core-providers/scenarios/utils.go (2)

GetErrorMessage (588-612)

GetTextCompletionContent (341-354)

tests/core-providers/scenarios/embedding.go (4)

tests/core-providers/scenarios/test_retry_framework.go (5)

GetTestRetryConfigForScenario (1118-1150)

TestRetryContext (170-175)

EmbeddingRetryConfig (238-245)

EmbeddingRetryCondition (158-161)

WithEmbeddingTestRetry (1503-1653)

core/schemas/embedding.go (1)

BifrostEmbeddingResponse (22-28)

core/schemas/bifrost.go (1)

BifrostError (353-362)

tests/core-providers/scenarios/utils.go (1)

GetErrorMessage (588-612)

tests/core-providers/scenarios/transcription_stream.go (7)

tests/core-providers/scenarios/test_retry_framework.go (5)

GetTestRetryConfigForScenario (1118-1150)

TestRetryContext (170-175)

SpeechRetryConfig (218-225)

SpeechRetryCondition (146-149)

WithSpeechTestRetry (1328-1478)

core/schemas/provider.go (1)

Provider (207-234)

tests/core-providers/scenarios/validation_presets.go (1)

SpeechExpectations (148-164)

core/schemas/speech.go (1)

BifrostSpeechResponse (22-29)

core/schemas/bifrost.go (2)

BifrostError (353-362)

SpeechRequest (94-94)

tests/core-providers/scenarios/utils.go (1)

GetErrorMessage (588-612)

core/utils.go (1)

GetErrorMessage (239-276)

tests/core-providers/scenarios/chat_completion_stream.go (3)

tests/core-providers/scenarios/test_retry_framework.go (3)

StreamingRetryConfig (821-841)

TestRetryContext (170-175)

WithStreamRetry (582-690)

core/schemas/bifrost.go (3)

BifrostStream (318-325)

BifrostError (353-362)

ChatCompletionStreamRequest (90-90)

tests/core-providers/scenarios/utils.go (1)

GetErrorMessage (588-612)

tests/core-providers/scenarios/response_validation.go (3)

core/schemas/models.go (1)

BifrostListModelsResponse (36-45)

core/schemas/bifrost.go (3)

BifrostError (353-362)

RequestType (83-83)

ListModelsRequest (86-86)

tests/core-providers/scenarios/error_parser.go (3)

ParseBifrostError (47-104)

FormatErrorConcise (353-356)

LogError (359-366)

tests/core-providers/scenarios/list_models.go (7)

tests/core-providers/scenarios/test_retry_framework.go (5)

GetTestRetryConfigForScenario (1118-1150)

TestRetryContext (170-175)

ListModelsRetryConfig (248-255)

ListModelsRetryCondition (164-167)

WithListModelsTestRetry (1846-1966)

core/schemas/provider.go (1)

Provider (207-234)

tests/core-providers/scenarios/response_validation.go (1)

ResponseExpectations (18-43)

plugins/mocker/main.go (1)

Conditions (89-94)

core/schemas/models.go (2)

BifrostListModelsResponse (36-45)

BifrostListModelsRequest (23-34)

core/schemas/bifrost.go (2)

BifrostError (353-362)

ListModelsRequest (86-86)

tests/core-providers/scenarios/utils.go (1)

GetErrorMessage (588-612)

core/providers/gemini/utils.go (1)

core/providers/gemini/types.go (2)

Content (871-879)

Part (885-909)

tests/core-providers/scenarios/speech_synthesis_stream.go (3)

tests/core-providers/scenarios/test_retry_framework.go (2)

WithSpeechStreamValidationRetry (1990-2128)

SpeechStreamValidationResult (1980-1986)

core/schemas/bifrost.go (2)

BifrostStream (318-325)

BifrostError (353-362)

core/schemas/speech.go (1)

BifrostSpeechStreamResponse (133-138)

transports/bifrost-http/lib/config.go (2)

core/schemas/mcp.go (1)

MCPConfig (11-13)

framework/configstore/clientconfig.go (1)

GovernanceConfig (67-74)

🔇 Additional comments (17)

tests/core-providers/tests.go (1)

113-115: LGTM! Clean structured logging format.

The bracketed status indicators for the configuration summary are clear and consistent, providing both machine-parseable structure and human readability.

framework/changelog.md (1)

1-2: Changelog entries look consistent and scoped correctly

The new/updated entries accurately describe the corresponding framework changes and follow the existing bullet style.

core/changelog.md (1)

1-4: Core changelog entries correctly summarize the fixes

The added lines clearly document Bedrock/Gemini/OpenAI streaming fixes and provider test enhancements; no further adjustments needed.

core/providers/bedrock/bedrock.go (1)

909-927: Final Bedrock ResponsesStream latency now correctly reflects total stream duration

Using time.Since(startTime) only for the last finalResponse while keeping per-response Latency based on lastChunkTime aligns the “last chunk” metrics with overall stream time without disturbing earlier chunk timing. This looks correct and backward-compatible with existing chunk-level semantics.

tests/core-providers/scenarios/speech_synthesis.go (2)

85-113: Base speech synthesis test’s retry integration looks correct

Wrapping the core SpeechRequest in WithSpeechTestRetry with a SpeechRetryConfig derived from GetTestRetryConfigForScenario("SpeechSynthesis", …) plus SpeechExpectations is consistent with the retry framework and keeps scenario metadata (TestRetryContext) rich for debugging. The error handling and subsequent validations are well-ordered.

Also applies to: 116-123

210-243: HD long-text speech synthesis scenario uses the retry framework safely

The HD variant correctly reuses the retry config pattern, validates speechResponse and Audio before size checks, and logs model mismatches as warnings instead of hard failures. The explicit audioSize threshold and logging make this scenario robust and debuggable.

tests/core-providers/scenarios/text_completion_stream.go (1)

3-5: Text-completion streaming retry wiring looks solid

The new use of StreamingRetryConfig + TestRetryContext with WithStreamRetry for the varied-prompts and parameterized streaming tests is consistent with the base streaming test and the shared retry framework. Scenario names and metadata are descriptive, and existing read/validation loops remain unchanged.

No issues from my side here.

Also applies to: 253-268, 395-412

tests/core-providers/scenarios/chat_completion_stream.go (1)

187-201: Adding “paris” expectation is safe given content checks are now warnings

expectations.ShouldContainAnyOf = append(..., "paris") relies on ShouldContainAnyOf now only generating warnings (not hard failures) in validateChatContent. This preserves test stability while still surfacing when providers omit the expected city mention.

tests/core-providers/scenarios/list_models.go (1)

31-52: ListModels retry + expectations are wired correctly and fix prior expected_provider typing

Retry config/context and WithListModelsTestRetry usage are consistent with the shared retry framework.

ResponseExpectations now sets "expected_provider": string(testConfig.Provider), which matches the string assertion in validateListModelsFields and addresses the earlier bug where the provider check was effectively disabled.

Post‑retry guards on bifrostErr, response, and len(response.Data) == 0 clearly surface hard failures after retries.

Looks good as a robust, retry‑aware ListModels test.

Also applies to: 53-81

tests/core-providers/scenarios/response_validation.go (3)

384-449: Converting strict content keyword/pattern checks to warnings aligns with non‑deterministic LLM behavior

For chat, text completion, and Responses content validators, missing ShouldContainKeywords/ShouldContainAnyOf, presence of ShouldNotContainWords, and pattern mismatches now add warnings instead of failing the test.

Empty content and structural failures are still treated as hard errors.

This strikes a good balance: tests still fail on clearly bad responses but won’t flake when providers phrase things differently than expected.

Also applies to: 570-636, 715-781

276-311: List‑models validation function and helpers are consistent with retry usage; provider check now works

ValidateListModelsResponse follows the same pattern as other validators: early‑out on error/nil response, then field validation + metrics + logging.

validateListModelsFields enforces:

non‑empty Data and non‑empty model IDs (for the base ListModels test),

provider match using expected_provider (string) against response.ExtraFields.Provider,

correct RequestType == schemas.ListModelsRequest,

non‑negative latency when ShouldHaveLatency,

optional min_model_count constraint.

collectListModelsResponseMetrics records useful flags/metrics (model_count, next_page_token, provider, request_type, latency presence), which integrate nicely with the broader metrics framework.

The earlier type mismatch for expected_provider (string vs ModelProvider) is now resolved at the call sites by passing string(testConfig.Provider), so this check will actually execute.

Also applies to: 1030-1093, 1095-1102

1347-1370: Error logging now guarantees ❌‑prefixed lines, which improves retry diagnostics

logValidationResults still logs failures via t.Logf (not t.Errorf), which is important for allowing retries.

Each error line is normalized to include a leading ❌, ensuring consistency even when individual validators omit the prefix.

This should make it much easier to scan CI logs for failures and integrates nicely with the retry framework’s use of ❌‑prefixed reasons.

tests/core-providers/scenarios/responses_stream.go (5)

48-213: ResponsesStream validation wrapper integration looks correct

The use of WithResponsesStreamValidationRetry with a per-chunk reader, safety cap (>500 chunks), and timeout-based failure path is coherent and should work well with the new retry framework; nothing blocking here from a correctness perspective.

244-252: Final validation aggregation and error formatting look good

Combining validationResult.Errors and validationResult.StreamErrors and then normalizing the message with a single ❌ prefix gives clear failure output while preserving individual error messages; no issues here.

307-327: Adopting WithStreamRetry for ResponsesStreamWithTools is appropriate

Routing the tools streaming scenario through WithStreamRetry with a populated TestRetryContext aligns it with the broader retry framework and should help reduce flakiness without changing the assertion semantics below.

440-460: Reasoning stream now consistently goes through the retry wrapper

Using WithStreamRetry for the reasoning stream test (with reasoning-specific flags in ExpectedBehavior/TestMetadata) is a clean integration and keeps this scenario consistent with other streaming tests.

565-583: Lifecycle stream test’s migration to WithStreamRetry looks solid

The lifecycle scenario now benefits from the common streaming retry behavior while preserving the existing, detailed lifecycle assertions below; this is a straightforward and correct application of the retry helper.

akshaydeo · 2025-11-24T10:29:21Z

Merge activity

Nov 24, 10:29 AM UTC: A user started a stack merge that includes this pull request via Graphite.
Nov 24, 10:29 AM UTC: @akshaydeo merged this pull request with Graphite.

Pratham-Mishra04 requested review from akshaydeo and danpiths November 24, 2025 07:15

akshaydeo mentioned this pull request Nov 24, 2025

moves all tests related to core to corresponding files #916

Merged

16 tasks

Pratham-Mishra04 force-pushed the 11-24-feat_core_tests_retry_logic_enhancements branch from ebe3476 to 8661443 Compare November 24, 2025 07:32

coderabbitai Bot reviewed Nov 24, 2025

View reviewed changes

coderabbitai Bot approved these changes Nov 24, 2025

View reviewed changes

TejasGhatte mentioned this pull request Nov 24, 2025

fix: integration test cases #671

Merged

15 tasks

Pratham-Mishra04 force-pushed the 11-24-feat_core_tests_retry_logic_enhancements branch 3 times, most recently from 1c9c6fe to 2b3d485 Compare November 24, 2025 09:02

coderabbitai Bot requested a review from TejasGhatte November 24, 2025 09:09

coderabbitai Bot requested changes Nov 24, 2025

View reviewed changes

Comment thread tests/core-providers/scenarios/list_models.go

Comment thread tests/core-providers/scenarios/response_validation.go

Comment thread tests/core-providers/scenarios/speech_synthesis_stream.go

Pratham-Mishra04 force-pushed the 11-24-feat_core_tests_retry_logic_enhancements branch 2 times, most recently from 57aa424 to 78cb6be Compare November 24, 2025 09:49

coderabbitai Bot approved these changes Nov 24, 2025

View reviewed changes

coderabbitai Bot requested changes Nov 24, 2025

View reviewed changes

Comment thread tests/core-providers/scenarios/utils.go

Comment thread transports/bifrost-http/lib/config.go

feat: core tests retry logic enhancements

dd8cd0c

akshaydeo force-pushed the 11-24-feat_core_tests_retry_logic_enhancements branch from 78cb6be to dd8cd0c Compare November 24, 2025 10:15

akshaydeo mentioned this pull request Nov 24, 2025

fixes integration tests flow #922

Closed

18 tasks

coderabbitai Bot approved these changes Nov 24, 2025

View reviewed changes

akshaydeo merged commit 91b651f into main Nov 24, 2025
7 of 8 checks passed

akshaydeo deleted the 11-24-feat_core_tests_retry_logic_enhancements branch November 24, 2025 10:29

coderabbitai Bot mentioned this pull request Nov 28, 2025

fix: prompt caching issues for OpenAI chat completions and OrderedMap implementation #952

Merged

16 tasks

Conversation

Pratham-Mishra04 commented Nov 24, 2025 • edited by akshaydeo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Affected areas

How to test

Screenshots/Recordings

Breaking changes

Related issues

Security considerations

Checklist

Uh oh!

github-actions Bot commented Nov 24, 2025

🧪 Test Suite Available

Uh oh!

Pratham-Mishra04 commented Nov 24, 2025 • edited by akshaydeo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

akshaydeo commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pratham-Mishra04 commented Nov 24, 2025 •

edited by akshaydeo

Loading

Pratham-Mishra04 commented Nov 24, 2025 •

edited by akshaydeo

Loading

coderabbitai Bot commented Nov 24, 2025 •

edited

Loading

akshaydeo commented Nov 24, 2025 •

edited

Loading