Skip to content

fix(test): update reasoning_effort test to expect dict format#21271

Merged
jquinter merged 1 commit intomainfrom
fix/anthropic-pass-through-reasoning-effort-test
Feb 15, 2026
Merged

fix(test): update reasoning_effort test to expect dict format#21271
jquinter merged 1 commit intomainfrom
fix/anthropic-pass-through-reasoning-effort-test

Conversation

@jquinter
Copy link
Contributor

Summary

Fixes broken test that was expecting reasoning_effort to be a string, but the code now returns a dict format.

Problem

Test test_openai_model_with_thinking_converts_to_reasoning_effort was failing with:

AssertionError: reasoning_effort should be 'minimal' for budget_tokens=1024, 
got {'effort': 'minimal', 'summary': 'detailed'}

Test fails on main branch ❌ - This is a pre-existing broken test, not a regression.

Root Cause

Code in litellm/llms/anthropic/experimental_pass_through/adapters/handler.py (lines 72-74) transforms reasoning_effort from a string to a dict:

completion_kwargs["reasoning_effort"] = {
    "effort": reasoning_effort,  # "minimal"
    "summary": "detailed",
}

The test was never updated to expect this dict format.

Solution

Updated test expectations from:

assert call_kwargs["reasoning_effort"] == "minimal"

To:

expected_reasoning_effort = {"effort": "minimal", "summary": "detailed"}
assert call_kwargs["reasoning_effort"] == expected_reasoning_effort

Testing

pytest tests/.../test_anthropic_experimental_pass_through_messages_handler.py::test_openai_model_with_thinking_converts_to_reasoning_effort -v
======================== 1 passed in 0.14s ========================

Related

This test failure was reported on PR #21217, but NOT caused by PR #21217 (which only modifies test_anthropic_structured_output.py). This is a pre-existing bug that also fails on the main branch.

🤖 Generated with Claude Code

Update test expectations to match the current code behavior where
reasoning_effort is transformed from a string to a dict with
'effort' and 'summary' fields.

The transformation happens in:
litellm/llms/anthropic/experimental_pass_through/adapters/handler.py:72-74

When reasoning_effort is a string like "minimal", it's converted to:
{"effort": "minimal", "summary": "detailed"}

The test was expecting just the string "minimal", causing it to fail.

Test now passes ✅

Related: test was failing on PR #21217, but NOT caused by PR #21217
(which only modifies test_anthropic_structured_output.py). This is a
pre-existing broken test that also fails on main branch.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Feb 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Feb 15, 2026 10:09pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 15, 2026

Greptile Summary

Fixes a broken test assertion in test_openai_model_with_thinking_converts_to_reasoning_effort where the expected value for reasoning_effort was a plain string ("minimal"), but the actual code pipeline transforms it into a dict ({"effort": "minimal", "summary": "detailed"}).

  • The test exercises the full pipeline: translate_anthropic_to_openai first sets reasoning_effort as a string, then _route_openai_thinking_to_responses_api_if_needed transforms it into a dict with effort and summary fields for OpenAI models routed through the Responses API.
  • The test assertion is updated to match the actual dict output. The companion unit test (test_non_claude_model_converts_thinking_to_reasoning_effort) correctly continues to expect a string since it tests translate_thinking_for_model in isolation.
  • The PR includes evidence of the fix: passing test output is provided in the description.

Confidence Score: 5/5

  • This PR is safe to merge — it only fixes a test assertion to match existing production behavior.
  • The change is a minimal, correct test fix. It aligns the test expectation with the actual code behavior in the handler pipeline. No production code is modified, and the fix has been verified with a passing test run.
  • No files require special attention.

Important Files Changed

Filename Overview
tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_anthropic_experimental_pass_through_messages_handler.py Updates test assertion for reasoning_effort from expecting a string to expecting the dict format ({"effort": "minimal", "summary": "detailed"}) that the full pipeline actually produces. Correct fix for a pre-existing broken test.

Flowchart

flowchart TD
    A["anthropic_messages_handler\n(model: openai/gpt-5.2, thinking: {type: enabled, budget_tokens: 1024})"] --> B["_prepare_completion_kwargs"]
    B --> C["translate_anthropic_to_openai\n(sets reasoning_effort = 'minimal')"]
    C --> D["_route_openai_thinking_to_responses_api_if_needed\n(transforms to dict for OpenAI models)"]
    D --> E["reasoning_effort = {effort: 'minimal', summary: 'detailed'}"]
    E --> F["litellm.completion(**completion_kwargs)"]
Loading

Last reviewed commit: 0812323

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@jquinter jquinter merged commit f20dd25 into main Feb 15, 2026
17 of 23 checks passed
jquinter added a commit that referenced this pull request Feb 15, 2026
…ution

Implements three key improvements to reduce test flakiness from parallel execution:

1. **Split Vertex AI tests into separate group** (workers: 1)
   - Vertex AI tests often have environment variable pollution issues
   - Running serially prevents cross-test interference with GOOGLE_APPLICATION_CREDENTIALS
   - Isolates authentication-related test failures

2. **Reduce workers for other LLM tests** (4 -> 2)
   - Decreases chance of race conditions and state conflicts
   - Still parallel but with less contention

3. **Add --dist=loadscope to pytest-xdist**
   - Keeps tests from the same file together on one worker
   - Reduces interference between unrelated test modules
   - Data shows 70% pass rate WITH loadscope vs 40% WITHOUT
   - Better test isolation while maintaining parallelism

Note: loadscope exposes one tokenizer cache issue in core-utils which will be
fixed in a separate PR. The tradeoff is worth it (7/10 pass vs 4/10 without).

These changes address the root causes of intermittent test failures in:
PRs #21268, #21271, #21272, #21273, #21275, #21276:
- Environment variable pollution (GOOGLE_APPLICATION_CREDENTIALS, VERTEXAI_PROJECT)
- Global state conflicts (litellm.known_tokenizer_config)
- Async mock timing issues with parallel execution

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
jquinter added a commit that referenced this pull request Feb 18, 2026
…ution

Implements three key improvements to reduce test flakiness from parallel execution:

1. **Split Vertex AI tests into separate group** (workers: 1)
   - Vertex AI tests often have environment variable pollution issues
   - Running serially prevents cross-test interference with GOOGLE_APPLICATION_CREDENTIALS
   - Isolates authentication-related test failures

2. **Reduce workers for other LLM tests** (4 -> 2)
   - Decreases chance of race conditions and state conflicts
   - Still parallel but with less contention

3. **Add --dist=loadscope to pytest-xdist**
   - Keeps tests from the same file together on one worker
   - Reduces interference between unrelated test modules
   - Data shows 70% pass rate WITH loadscope vs 40% WITHOUT
   - Better test isolation while maintaining parallelism

Note: loadscope exposes one tokenizer cache issue in core-utils which will be
fixed in a separate PR. The tradeoff is worth it (7/10 pass vs 4/10 without).

These changes address the root causes of intermittent test failures in:
PRs #21268, #21271, #21272, #21273, #21275, #21276:
- Environment variable pollution (GOOGLE_APPLICATION_CREDENTIALS, VERTEXAI_PROJECT)
- Global state conflicts (litellm.known_tokenizer_config)
- Async mock timing issues with parallel execution

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant