fix: improve Langfuse test isolation to prevent flaky failures by jquinter · Pull Request #21073 · BerriAI/litellm

jquinter · 2026-02-12T22:54:43Z

Problem

The test_log_langfuse_v2_handles_null_usage_values test was intermittently failing in CI with:

AssertionError: Expected 'generation' to have been called once. Called 0 times.

The test passed consistently locally but failed randomly in CI, blocking builds for everyone.

Root Cause

The test was creating fresh mocks to avoid state pollution:

mock_trace = MagicMock()
mock_client = MagicMock()
self.logger.Langfuse = mock_client

However, this approach didn't fully isolate from the setUp method's mock configuration, leading to inconsistent behavior in CI environments where test ordering or timing might differ.

Fix

Instead of creating entirely new mocks, properly reset the existing setUp mocks using .reset_mock():

self.mock_langfuse_client.reset_mock()
self.mock_langfuse_trace.reset_mock()
self.mock_langfuse_generation.reset_mock()

This ensures:

Clean mock state for each test
Proper mock chain configuration is maintained
Better test isolation without losing the benefits of setUp

Testing

poetry run pytest tests/test_litellm/integrations/test_langfuse.py::TestLangfuseUsageDetails::test_log_langfuse_v2_handles_null_usage_values -v

✅ Test passes consistently

This fix improves test reliability and should prevent the flaky CI failures.

🤖 Generated with Claude Code

@haggai-backline

…ls in both streaming and non-streaming When using both `tools` and `response_format` with Bedrock Converse API, LiteLLM internally adds a fake tool called `json_tool_call` to handle structured output. Bedrock may return both this internal tool AND real user-defined tools, causing consumers like OpenAI Agents SDK to break trying to dispatch `json_tool_call`. This fix: - Extracts `_filter_json_mode_tools()` to handle 3 scenarios: only json_tool_call (convert to content), mixed with real tools (filter it out), or no json_tool_call - Fixes streaming by adding json_mode awareness to AWSEventStreamDecoder, converting json_tool_call chunks to text content while passing real tool chunks through - Changes `optional_params.pop("json_mode")` to `.get()` to avoid mutating caller dict Fixes BerriAI#18381 Credits @haggai-backline for the original investigation in PR BerriAI#18384 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…_tool_call content in mixed case - Move `import json` to top of converse_transformation.py per CLAUDE.md style guide - In the mixed tools case, preserve json_tool_call arguments as message content so the structured output from response_format is not silently lost - Update test to verify json_tool_call content is preserved as message text Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

httpx.Response.json() is synchronous, not async. Using AsyncMock made the test fail because it turned json() into a coroutine.

The test was expecting resource/id to be the CZRN value, but the implementation (line 144 in transform.py) explicitly sets it to the model name. The CZRN is used only to extract components for tags. This was causing the test to fail with: AssertionError: assert 'gpt-4' == 'test-czrn'

The test was creating fresh mocks but not fully isolating from setUp state, causing intermittent CI failures with 'Expected generation to be called once. Called 0 times.' Instead of creating fresh mocks, properly reset the existing setUp mocks to ensure clean state while maintaining proper mock chain configuration.

vercel · 2026-02-12T22:54:48Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Ready	Preview, Comment	Feb 12, 2026 10:56pm

greptile-apps · 2026-02-12T22:57:45Z

Greptile Overview

Greptile Summary

This PR bundles several fixes: (1) Bedrock Converse now properly handles mixed json_tool_call + real tool responses by filtering out the internal json_tool_call in both non-streaming (_filter_json_mode_tools) and streaming (AWSEventStreamDecoder) paths, fixing #18381. (2) Changed optional_params.pop("json_mode") to .get("json_mode") to prevent mutating the caller's dict. (3) Fixed flaky Langfuse test by using reset_mock() instead of creating fresh mocks. (4) Various test fixes for MCP, CloudZero, and Anthropic pass-through mocks.

Extracted _filter_json_mode_tools static method in AmazonConverseConfig to handle 3 scenarios: only json_tool_call, mixed with real tools, and no json_tool_call
Added json_mode parameter to AWSEventStreamDecoder for streaming-level json_tool_call suppression, converting tool input to text content
Fixed optional_params mutation by switching from .pop() to .get() in _transform_response
Fixed flaky Langfuse test by using reset_mock() on existing setUp mocks instead of creating new ones
Updated MCP test fake_process signatures to accept **kwargs
Corrected CloudZero test expectation for resource/id to match actual implementation
Fixed Anthropic pass-through test to use MagicMock instead of AsyncMock for synchronous httpx.Response.json()

Confidence Score: 4/5

This PR is safe to merge — the Bedrock json_mode refactor is well-structured with good test coverage, and the test fixes are straightforward corrections.
The Bedrock changes are a meaningful refactor with proper edge case handling (3 scenarios in _filter_json_mode_tools, streaming suppression logic). Tests cover the new functionality well including backward compatibility. The Langfuse and other test fixes are low-risk. Minor deduction because the Bedrock streaming changes introduce state tracking (_current_tool_name) that should be exercised in more complex multi-tool streaming scenarios.
Pay close attention to litellm/llms/bedrock/chat/invoke_handler.py and litellm/llms/bedrock/chat/converse_transformation.py — these contain the core Bedrock json_mode filtering logic.

Important Files Changed

Filename	Overview
litellm/llms/bedrock/chat/converse_transformation.py	Refactored json_tool_call filtering into `_filter_json_mode_tools` static method handling 3 scenarios (only json, mixed, none). Changed `.pop("json_mode")` to `.get("json_mode")` to prevent mutating `optional_params`. Well-structured with good edge case handling.
litellm/llms/bedrock/chat/invoke_handler.py	Added `json_mode` parameter to `AWSEventStreamDecoder` with streaming-level filtering of `json_tool_call`. Uses `_current_tool_name` state to track and suppress json_tool_call blocks, converting their content to text instead. Correctly handles start/delta/stop events.
litellm/llms/bedrock/chat/converse_handler.py	Simple one-line change passing `json_mode` to `AWSEventStreamDecoder` constructor, enabling streaming json_tool_call filtering.
tests/test_litellm/integrations/test_langfuse.py	Replaced fresh mock creation with `reset_mock()` on existing `setUp` mocks. Preserves mock chain configuration while ensuring clean state, fixing flaky CI failures.
tests/test_litellm/llms/bedrock/chat/test_converse_transformation.py	Added comprehensive tests for `_filter_json_mode_tools` (mixed tool scenario, optional_params mutation prevention) and streaming decoder (json_tool_call filtering, backward compatibility). All mock-based, no network calls.
tests/mcp_tests/test_mcp_chat_completions.py	Updated `fake_process` signatures to accept `**kwargs` for compatibility with updated caller signatures. Minimal change.
tests/test_litellm/integrations/cloudzero/test_transform.py	Corrected test expectation for `resource/id` field from `'test-czrn'` to `'gpt-4'` to match actual implementation behavior at line 144 of transform.py.
tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_anthropic_experimental_pass_through_messages_handler.py	Changed `AsyncMock` to `MagicMock` for `httpx.Response` mock since `json()` is synchronous. Added clarifying comments.
poetry.lock	Auto-regenerated lockfile. Minor version bump for `litellm-proxy-extras` (0.4.33 → 0.4.35). Poetry version difference (2.2.0 → 2.1.4).

_{Last reviewed commit: 6a56399}

greptile-apps

_{9 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

krrishdholakia · 2026-02-13T04:22:06Z

litellm/llms/bedrock/chat/converse_handler.py

why does this test modify bedrock?

jquinter · 2026-02-13T08:30:18Z

Closing this PR - it was accidentally created from the bedrock branch instead of main, which included unrelated changes.

Created a clean replacement: #21093

The new PR contains only the Langfuse test isolation fix without any bedrock or other unrelated changes.

Thanks @krrishdholakia for catching this!

jquinter and others added 7 commits February 12, 2026 19:12

chore: regenerate poetry.lock after rebase

2dd4078

fix: update MCP test mocks to accept litellm_trace_id parameter

292ca4d

fix: use MagicMock instead of AsyncMock for httpx.Response mock

49914db

httpx.Response.json() is synchronous, not async. Using AsyncMock made the test fail because it turned json() into a coroutine.

vercel bot deployed to Preview February 12, 2026 22:56 View deployment

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

krrishdholakia reviewed Feb 13, 2026

View reviewed changes

litellm/llms/bedrock/chat/converse_handler.py

Copy link

Member

krrishdholakia Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this test modify bedrock?

jquinter mentioned this pull request Feb 13, 2026

fix: improve Langfuse test isolation to prevent flaky failures #21093

Merged

jquinter closed this Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: improve Langfuse test isolation to prevent flaky failures#21073

fix: improve Langfuse test isolation to prevent flaky failures#21073
jquinter wants to merge 7 commits intoBerriAI:mainfrom
jquinter:fix/langfuse-test-flaky-mock

jquinter commented Feb 12, 2026

Uh oh!

vercel bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 12, 2026

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Uh oh!

krrishdholakia Feb 13, 2026

Uh oh!

jquinter commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jquinter commented Feb 12, 2026

Problem

Root Cause

Fix

Testing

Uh oh!

vercel bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

krrishdholakia Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

jquinter commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Feb 12, 2026 •

edited

Loading