Skip to content

fix: improve Langfuse test isolation to prevent flaky failures#21073

Closed
jquinter wants to merge 7 commits intoBerriAI:mainfrom
jquinter:fix/langfuse-test-flaky-mock
Closed

fix: improve Langfuse test isolation to prevent flaky failures#21073
jquinter wants to merge 7 commits intoBerriAI:mainfrom
jquinter:fix/langfuse-test-flaky-mock

Conversation

@jquinter
Copy link
Contributor

Problem

The test_log_langfuse_v2_handles_null_usage_values test was intermittently failing in CI with:

AssertionError: Expected 'generation' to have been called once. Called 0 times.

The test passed consistently locally but failed randomly in CI, blocking builds for everyone.

Root Cause

The test was creating fresh mocks to avoid state pollution:

mock_trace = MagicMock()
mock_client = MagicMock()
self.logger.Langfuse = mock_client

However, this approach didn't fully isolate from the setUp method's mock configuration, leading to inconsistent behavior in CI environments where test ordering or timing might differ.

Fix

Instead of creating entirely new mocks, properly reset the existing setUp mocks using .reset_mock():

self.mock_langfuse_client.reset_mock()
self.mock_langfuse_trace.reset_mock()
self.mock_langfuse_generation.reset_mock()

This ensures:

  • Clean mock state for each test
  • Proper mock chain configuration is maintained
  • Better test isolation without losing the benefits of setUp

Testing

poetry run pytest tests/test_litellm/integrations/test_langfuse.py::TestLangfuseUsageDetails::test_log_langfuse_v2_handles_null_usage_values -v

✅ Test passes consistently

This fix improves test reliability and should prevent the flaky CI failures.

🤖 Generated with Claude Code

jquinter and others added 7 commits February 12, 2026 19:12
…ls in both streaming and non-streaming

When using both `tools` and `response_format` with Bedrock Converse API, LiteLLM
internally adds a fake tool called `json_tool_call` to handle structured output.
Bedrock may return both this internal tool AND real user-defined tools, causing
consumers like OpenAI Agents SDK to break trying to dispatch `json_tool_call`.

This fix:
- Extracts `_filter_json_mode_tools()` to handle 3 scenarios: only json_tool_call
  (convert to content), mixed with real tools (filter it out), or no json_tool_call
- Fixes streaming by adding json_mode awareness to AWSEventStreamDecoder, converting
  json_tool_call chunks to text content while passing real tool chunks through
- Changes `optional_params.pop("json_mode")` to `.get()` to avoid mutating caller dict

Fixes BerriAI#18381
Credits @haggai-backline for the original investigation in PR BerriAI#18384

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…_tool_call content in mixed case

- Move `import json` to top of converse_transformation.py per CLAUDE.md style guide
- In the mixed tools case, preserve json_tool_call arguments as message content
  so the structured output from response_format is not silently lost
- Update test to verify json_tool_call content is preserved as message text

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
httpx.Response.json() is synchronous, not async. Using AsyncMock
made the test fail because it turned json() into a coroutine.
The test was expecting resource/id to be the CZRN value, but the
implementation (line 144 in transform.py) explicitly sets it to the
model name. The CZRN is used only to extract components for tags.

This was causing the test to fail with:
AssertionError: assert 'gpt-4' == 'test-czrn'
The test was creating fresh mocks but not fully isolating from setUp state,
causing intermittent CI failures with 'Expected generation to be called once.
Called 0 times.'

Instead of creating fresh mocks, properly reset the existing setUp mocks to
ensure clean state while maintaining proper mock chain configuration.
@vercel
Copy link

vercel bot commented Feb 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Feb 12, 2026 10:56pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

This PR bundles several fixes: (1) Bedrock Converse now properly handles mixed json_tool_call + real tool responses by filtering out the internal json_tool_call in both non-streaming (_filter_json_mode_tools) and streaming (AWSEventStreamDecoder) paths, fixing #18381. (2) Changed optional_params.pop("json_mode") to .get("json_mode") to prevent mutating the caller's dict. (3) Fixed flaky Langfuse test by using reset_mock() instead of creating fresh mocks. (4) Various test fixes for MCP, CloudZero, and Anthropic pass-through mocks.

  • Extracted _filter_json_mode_tools static method in AmazonConverseConfig to handle 3 scenarios: only json_tool_call, mixed with real tools, and no json_tool_call
  • Added json_mode parameter to AWSEventStreamDecoder for streaming-level json_tool_call suppression, converting tool input to text content
  • Fixed optional_params mutation by switching from .pop() to .get() in _transform_response
  • Fixed flaky Langfuse test by using reset_mock() on existing setUp mocks instead of creating new ones
  • Updated MCP test fake_process signatures to accept **kwargs
  • Corrected CloudZero test expectation for resource/id to match actual implementation
  • Fixed Anthropic pass-through test to use MagicMock instead of AsyncMock for synchronous httpx.Response.json()

Confidence Score: 4/5

  • This PR is safe to merge — the Bedrock json_mode refactor is well-structured with good test coverage, and the test fixes are straightforward corrections.
  • The Bedrock changes are a meaningful refactor with proper edge case handling (3 scenarios in _filter_json_mode_tools, streaming suppression logic). Tests cover the new functionality well including backward compatibility. The Langfuse and other test fixes are low-risk. Minor deduction because the Bedrock streaming changes introduce state tracking (_current_tool_name) that should be exercised in more complex multi-tool streaming scenarios.
  • Pay close attention to litellm/llms/bedrock/chat/invoke_handler.py and litellm/llms/bedrock/chat/converse_transformation.py — these contain the core Bedrock json_mode filtering logic.

Important Files Changed

Filename Overview
litellm/llms/bedrock/chat/converse_transformation.py Refactored json_tool_call filtering into _filter_json_mode_tools static method handling 3 scenarios (only json, mixed, none). Changed .pop("json_mode") to .get("json_mode") to prevent mutating optional_params. Well-structured with good edge case handling.
litellm/llms/bedrock/chat/invoke_handler.py Added json_mode parameter to AWSEventStreamDecoder with streaming-level filtering of json_tool_call. Uses _current_tool_name state to track and suppress json_tool_call blocks, converting their content to text instead. Correctly handles start/delta/stop events.
litellm/llms/bedrock/chat/converse_handler.py Simple one-line change passing json_mode to AWSEventStreamDecoder constructor, enabling streaming json_tool_call filtering.
tests/test_litellm/integrations/test_langfuse.py Replaced fresh mock creation with reset_mock() on existing setUp mocks. Preserves mock chain configuration while ensuring clean state, fixing flaky CI failures.
tests/test_litellm/llms/bedrock/chat/test_converse_transformation.py Added comprehensive tests for _filter_json_mode_tools (mixed tool scenario, optional_params mutation prevention) and streaming decoder (json_tool_call filtering, backward compatibility). All mock-based, no network calls.
tests/mcp_tests/test_mcp_chat_completions.py Updated fake_process signatures to accept **kwargs for compatibility with updated caller signatures. Minimal change.
tests/test_litellm/integrations/cloudzero/test_transform.py Corrected test expectation for resource/id field from 'test-czrn' to 'gpt-4' to match actual implementation behavior at line 144 of transform.py.
tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_anthropic_experimental_pass_through_messages_handler.py Changed AsyncMock to MagicMock for httpx.Response mock since json() is synchronous. Added clarifying comments.
poetry.lock Auto-regenerated lockfile. Minor version bump for litellm-proxy-extras (0.4.33 → 0.4.35). Poetry version difference (2.2.0 → 2.1.4).

Last reviewed commit: 6a56399

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this test modify bedrock?

@jquinter
Copy link
Contributor Author

Closing this PR - it was accidentally created from the bedrock branch instead of main, which included unrelated changes.

Created a clean replacement: #21093

The new PR contains only the Langfuse test isolation fix without any bedrock or other unrelated changes.

Thanks @krrishdholakia for catching this!

@jquinter jquinter closed this Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants