fix: improve Langfuse test isolation to prevent flaky failures#21073
fix: improve Langfuse test isolation to prevent flaky failures#21073jquinter wants to merge 7 commits intoBerriAI:mainfrom
Conversation
…ls in both streaming and non-streaming
When using both `tools` and `response_format` with Bedrock Converse API, LiteLLM
internally adds a fake tool called `json_tool_call` to handle structured output.
Bedrock may return both this internal tool AND real user-defined tools, causing
consumers like OpenAI Agents SDK to break trying to dispatch `json_tool_call`.
This fix:
- Extracts `_filter_json_mode_tools()` to handle 3 scenarios: only json_tool_call
(convert to content), mixed with real tools (filter it out), or no json_tool_call
- Fixes streaming by adding json_mode awareness to AWSEventStreamDecoder, converting
json_tool_call chunks to text content while passing real tool chunks through
- Changes `optional_params.pop("json_mode")` to `.get()` to avoid mutating caller dict
Fixes BerriAI#18381
Credits @haggai-backline for the original investigation in PR BerriAI#18384
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…_tool_call content in mixed case - Move `import json` to top of converse_transformation.py per CLAUDE.md style guide - In the mixed tools case, preserve json_tool_call arguments as message content so the structured output from response_format is not silently lost - Update test to verify json_tool_call content is preserved as message text Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
httpx.Response.json() is synchronous, not async. Using AsyncMock made the test fail because it turned json() into a coroutine.
The test was expecting resource/id to be the CZRN value, but the implementation (line 144 in transform.py) explicitly sets it to the model name. The CZRN is used only to extract components for tags. This was causing the test to fail with: AssertionError: assert 'gpt-4' == 'test-czrn'
The test was creating fresh mocks but not fully isolating from setUp state, causing intermittent CI failures with 'Expected generation to be called once. Called 0 times.' Instead of creating fresh mocks, properly reset the existing setUp mocks to ensure clean state while maintaining proper mock chain configuration.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile OverviewGreptile SummaryThis PR bundles several fixes: (1) Bedrock Converse now properly handles mixed
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| litellm/llms/bedrock/chat/converse_transformation.py | Refactored json_tool_call filtering into _filter_json_mode_tools static method handling 3 scenarios (only json, mixed, none). Changed .pop("json_mode") to .get("json_mode") to prevent mutating optional_params. Well-structured with good edge case handling. |
| litellm/llms/bedrock/chat/invoke_handler.py | Added json_mode parameter to AWSEventStreamDecoder with streaming-level filtering of json_tool_call. Uses _current_tool_name state to track and suppress json_tool_call blocks, converting their content to text instead. Correctly handles start/delta/stop events. |
| litellm/llms/bedrock/chat/converse_handler.py | Simple one-line change passing json_mode to AWSEventStreamDecoder constructor, enabling streaming json_tool_call filtering. |
| tests/test_litellm/integrations/test_langfuse.py | Replaced fresh mock creation with reset_mock() on existing setUp mocks. Preserves mock chain configuration while ensuring clean state, fixing flaky CI failures. |
| tests/test_litellm/llms/bedrock/chat/test_converse_transformation.py | Added comprehensive tests for _filter_json_mode_tools (mixed tool scenario, optional_params mutation prevention) and streaming decoder (json_tool_call filtering, backward compatibility). All mock-based, no network calls. |
| tests/mcp_tests/test_mcp_chat_completions.py | Updated fake_process signatures to accept **kwargs for compatibility with updated caller signatures. Minimal change. |
| tests/test_litellm/integrations/cloudzero/test_transform.py | Corrected test expectation for resource/id field from 'test-czrn' to 'gpt-4' to match actual implementation behavior at line 144 of transform.py. |
| tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_anthropic_experimental_pass_through_messages_handler.py | Changed AsyncMock to MagicMock for httpx.Response mock since json() is synchronous. Added clarifying comments. |
| poetry.lock | Auto-regenerated lockfile. Minor version bump for litellm-proxy-extras (0.4.33 → 0.4.35). Poetry version difference (2.2.0 → 2.1.4). |
Last reviewed commit: 6a56399
There was a problem hiding this comment.
why does this test modify bedrock?
|
Closing this PR - it was accidentally created from the bedrock branch instead of main, which included unrelated changes. Created a clean replacement: #21093 The new PR contains only the Langfuse test isolation fix without any bedrock or other unrelated changes. Thanks @krrishdholakia for catching this! |
Problem
The
test_log_langfuse_v2_handles_null_usage_valuestest was intermittently failing in CI with:The test passed consistently locally but failed randomly in CI, blocking builds for everyone.
Root Cause
The test was creating fresh mocks to avoid state pollution:
However, this approach didn't fully isolate from the
setUpmethod's mock configuration, leading to inconsistent behavior in CI environments where test ordering or timing might differ.Fix
Instead of creating entirely new mocks, properly reset the existing
setUpmocks using.reset_mock():This ensures:
setUpTesting
✅ Test passes consistently
This fix improves test reliability and should prevent the flaky CI failures.
🤖 Generated with Claude Code