Skip to content

Fix websearch interception with extended thinking mode support#20488

Open
mpcusack-altos wants to merge 1 commit intoBerriAI:mainfrom
mpcusack-altos:mcusack/05022026/websearch-thinking-blocks
Open

Fix websearch interception with extended thinking mode support#20488
mpcusack-altos wants to merge 1 commit intoBerriAI:mainfrom
mpcusack-altos:mcusack/05022026/websearch-thinking-blocks

Conversation

@mpcusack-altos
Copy link
Contributor

@mpcusack-altos mpcusack-altos commented Feb 5, 2026

Motivation

When Anthropic's extended thinking is enabled, assistant messages must start with thinking blocks before tool_use blocks. The agentic loop was creating follow-up messages with only tool_use blocks, causing validation errors. This change ensures thinking blocks from the original response are preserved and included at the start of follow-up assistant messages.

Implementation

  • Created TransformRequestResult NamedTuple to capture both tool_calls and thinking_blocks from transform_request(), making the contract explicit and extensible
  • Modified transform_request() to extract and return thinking/redacted_thinking blocks alongside tool calls
  • Updated transform_response() to accept thinking_blocks and prepend them to follow-up assistant messages
  • Passed thinking_blocks through the agentic loop chain: detection → execution → message transformation
  • Fixed transform_request() to return full kwargs (not just tools) to preserve other request parameters
  • Used filter_internal_params() utility instead of manual filtering for consistency

Additional Context

This change fixes websearch interception when extended thinking mode is enabled.

Problem: When Anthropic's extended thinking is enabled, assistant messages must start with thinking blocks before tool_use blocks. The agentic loop was creating follow-up messages with only tool_use blocks, causing the error: messages.1.content.0.type: Expected 'thinking' or 'redacted_thinking', but found 'tool_use'

Solution: Modified transform_request() to capture thinking/redacted_thinking blocks from the original response, and transform_response() to include them at the start of the assistant message in follow-up requests.

Testing: Successfully tested end-to-end with Claude Code → LiteLLM Proxy → AWS Bedrock → Claude Opus 4.5.

Configuration Example

model_list:
  - model_name: claude-opus-4-5-20251101
    litellm_params:
      model: bedrock/us.anthropic.claude-opus-4-5-20251101-v1:0
      aws_region_name: us-west-2
    model_info:
      supports_web_search: true
litellm_settings:
  callbacks: ["websearch_interception"]
  websearch_interception_params:
    enabled_providers: ["bedrock"]
    search_tool_name: "searxng-search"
search_tools:
  - search_tool_name: searxng-search
    litellm_params:
      search_provider: searxng
      api_base: "https://searxng.example.com"

Note: Uses bedrock/ (not bedrock/converse/) to route through anthropic_messages_handler() which supports agentic hooks.

@vercel
Copy link

vercel bot commented Feb 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Feb 18, 2026 10:15pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 5, 2026

Greptile Overview

Greptile Summary

This PR fixes websearch interception when Anthropic's extended thinking mode is enabled by capturing and preserving thinking blocks throughout the agentic loop.

Key Changes:

  • Introduced TransformRequestResult NamedTuple to explicitly return tool calls AND thinking blocks from transform_request()
  • Modified transform_response() to prepend thinking blocks to assistant messages, satisfying Anthropic's requirement that thinking blocks must appear before tool_use blocks
  • Threaded thinking_blocks through the entire agentic loop chain (detection → execution → message transformation)
  • Added thinking block normalization from SDK objects to dicts to ensure JSON serialization works correctly
  • Refactored filter_internal_params() to use prefix-based filtering for _websearch_interception* parameters
  • Changed async_pre_call_deployment_hook to return full kwargs instead of just tools dict
  • Router now stores custom_llm_provider in deployment.litellm_params for callback access

Code Quality Notes:

  • Contains duplicate tool conversion logic in both async_pre_call_deployment_hook and async_pre_request_hook that could be consolidated
  • Comprehensive test coverage added for new functionality including thinking block normalization

The implementation correctly addresses the validation error described in the PR description.

Confidence Score: 4/5

  • Safe to merge with minor code quality improvements recommended
  • The core logic for capturing and threading thinking blocks is sound and addresses the stated problem. The thinking block normalization from SDK objects to dicts is implemented correctly. However, there's code duplication in the handler that could be refactored, and the change to return full kwargs in the deployment hook could have unintended side effects if not carefully tested.
  • Pay attention to handler.py which contains duplicate tool conversion logic and returns full kwargs from deployment hook

Important Files Changed

Filename Overview
litellm/integrations/websearch_interception/transformation.py Introduces TransformRequestResult NamedTuple and adds thinking block capture/normalization logic to handle extended thinking mode requirements
litellm/integrations/websearch_interception/handler.py Threads thinking_blocks through agentic loop chain and changes hook return to spread all kwargs; contains duplicate tool conversion logic across two hooks
litellm/litellm_core_utils/core_helpers.py Adds INTERNAL_PARAMS_PREFIXES constant and refactors filter_internal_params() to use prefix-based filtering with clean implementation

Sequence Diagram

sequenceDiagram
    participant Client as Claude Code
    participant Router as LiteLLM Router
    participant Hook as WebSearchInterception
    participant Transform as WebSearchTransformation
    participant Provider as AWS Bedrock
    participant Search as litellm.asearch()

    Client->>Router: completion(model, messages, tools=[web_search])
    Router->>Hook: async_pre_call_deployment_hook(kwargs)
    Hook->>Hook: Convert native web_search to litellm_web_search tool
    Hook-->>Router: Return modified kwargs with converted tools
    Router->>Provider: API request with litellm_web_search tool
    Provider-->>Router: Response with thinking blocks + tool_use blocks
    Router->>Hook: async_should_run_agentic_loop(response)
    Hook->>Transform: transform_request(response)
    Transform->>Transform: Extract tool_calls and thinking_blocks from response
    Transform-->>Hook: TransformRequestResult(has_websearch=True, tool_calls, thinking_blocks)
    Hook-->>Router: Return (True, tools_dict)
    Router->>Hook: async_run_agentic_loop(tools, messages, kwargs)
    Hook->>Search: Execute searches in parallel via litellm.asearch()
    Search-->>Hook: Return search results
    Hook->>Transform: transform_response(tool_calls, results, thinking_blocks)
    Note over Transform: Builds assistant message with thinking blocks first,<br/>then tool_use blocks (Anthropic requirement)
    Transform-->>Hook: Return (assistant_message, user_message)
    Hook->>Provider: Follow-up request with messages + search results
    Provider-->>Hook: Final response with answer
    Hook-->>Router: Return final response
    Router-->>Client: Final response
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 5, 2026

Additional Comments (1)

litellm/integrations/websearch_interception/transformation.py
Mismatched tool/result lengths

transform_response() indexes search_results[i] for i in range(len(tool_calls)). If any tool call didn’t produce a result (e.g., earlier code adds an empty placeholder for missing query, or future callers pass fewer results), this will raise IndexError and break the agentic loop. Consider iterating over zip(tool_calls, search_results) (and/or asserting lengths match) to avoid runtime crashes.

Prompt To Fix With AI
This is a comment left during a code review.
Path: litellm/integrations/websearch_interception/transformation.py
Line: 186:197

Comment:
**Mismatched tool/result lengths**

`transform_response()` indexes `search_results[i]` for `i in range(len(tool_calls))`. If any tool call didn’t produce a result (e.g., earlier code adds an empty placeholder for missing `query`, or future callers pass fewer results), this will raise `IndexError` and break the agentic loop. Consider iterating over `zip(tool_calls, search_results)` (and/or asserting lengths match) to avoid runtime crashes.

How can I resolve this? If you propose a fix, please make it concise.

@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from ff442f5 to 9b4b15c Compare February 5, 2026 12:19
@mpcusack-altos mpcusack-altos marked this pull request as draft February 5, 2026 12:20
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from 9b4b15c to b3e5f72 Compare February 5, 2026 12:34
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from b3e5f72 to e2f0969 Compare February 5, 2026 12:50
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from e2f0969 to dc333f1 Compare February 5, 2026 12:54
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from dc333f1 to a6c5882 Compare February 5, 2026 13:00
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from a6c5882 to cb1ff68 Compare February 5, 2026 13:01
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from cb1ff68 to 0bb8498 Compare February 5, 2026 13:03
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch 2 times, most recently from 719508b to 74d8fd6 Compare February 5, 2026 13:06
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch 2 times, most recently from 28d335a to 7b541dd Compare February 5, 2026 13:09
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from 7b541dd to 1abc412 Compare February 5, 2026 13:13
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from 1abc412 to c8dd971 Compare February 5, 2026 13:19
@mpcusack-altos
Copy link
Contributor Author

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch from 5b2f59d to 1715711 Compare February 5, 2026 14:15
@mpcusack-altos mpcusack-altos force-pushed the mcusack/05022026/websearch-thinking-blocks branch 3 times, most recently from 69b73f5 to 5ab5727 Compare February 5, 2026 14:24
@jquinter
Copy link
Contributor

jquinter commented Feb 5, 2026

Hey @mpcusack-altos, solid PR — the thinking block fix is well-architected and the TransformRequestResult NamedTuple is a nice improvement. A few things to address:

1. Normalize thinking blocks to dicts when capturing

The blocks captured from response content could be Anthropic SDK response objects (e.g., ThinkingBlock(type='thinking', thinking='...')) rather than plain dicts, depending on the response format. They're later placed directly into the follow-up assistant message content:

assistant_content.extend(thinking_blocks)

If the blocks are SDK objects, this could fail when anthropic_messages.acreate() tries to serialize them for the next request. Consider normalizing to dicts when capturing in _detect_from_non_streaming_response:

if block_type in ("thinking", "redacted_thinking"):
    if isinstance(block, dict):
        thinking_blocks.append(block)
    else:
        # Normalize SDK objects to dicts for safe serialization in follow-up requests
        normalized = {"type": block_type}
        for attr in ("thinking", "data", "signature"):
            if hasattr(block, attr):
                normalized[attr] = getattr(block, attr)
        thinking_blocks.append(normalized)

2. Good catch on async_pre_call_deployment_hook

Worth calling out explicitly: the change from return {"tools": converted_tools} to return {**kwargs, "tools": converted_tools} also fixes a separate bug. The return value of this hook replaces all kwargs in litellm/utils.py:1257-1258, so the old code was silently stripping everything except tools. Nice find.

3. Missing blank line in core_helpers.py

_is_param_internal (line 401) is missing the PEP 8 blank line before a top-level function definition — this will likely fail the linter.

4. Tests required

The project requires at least 1 test in tests/litellm/. The TransformRequestResult extraction and transform_response with thinking blocks are easily unit-testable with mock content blocks.

5. Conflicts with PR #20489

Heads up — PR #20489 also modifies _execute_agentic_loop in the same file (adding max_tokens validation and thinking param conditional dropping). Your changes are complementary but will conflict on merge. Might be worth coordinating merge order.


Side note: Greptile flagged "TypeError from startswith(set)" and "membership test against None" as critical bugs in core_helpers.py. Both are false positives — your code correctly uses any() iteration and and short-circuiting respectively.

@mpcusack-altos
Copy link
Contributor Author

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 5, 2026

Additional Comments (1)

litellm/integrations/websearch_interception/handler.py
Duplicate logic in two hooks - async_pre_call_deployment_hook (lines 68-115) and async_pre_request_hook (lines 163-238) perform nearly identical tool conversion. Both iterate through tools and convert web search tools to LiteLLM standard format. Consider consolidating this logic into a shared helper method to reduce duplication and maintenance burden.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: litellm/integrations/websearch_interception/handler.py
Line: 68:115

Comment:
Duplicate logic in two hooks - `async_pre_call_deployment_hook` (lines 68-115) and `async_pre_request_hook` (lines 163-238) perform nearly identical tool conversion. Both iterate through tools and convert web search tools to LiteLLM standard format. Consider consolidating this logic into a shared helper method to reduce duplication and maintenance burden.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@mpcusack-altos
Copy link
Contributor Author

mpcusack-altos commented Feb 5, 2026

@jquinter Thanks for the review.

1-4) ✔️
5) I will reach out to that author.

CI linting/tests currently failing due to merge conflict cruft upsteam, but passes locally.

@Quentin-M
Copy link
Contributor

Quentin-M commented Feb 5, 2026

That's amazing!

Thank you so much for putting together such a comprehensive fix. Your PR is vastly superior to my poor attempt (#20489), so I will simply yield to your change, and simply address the API key loading & budget constraints on top of your PR separately.

@Quentin-M
Copy link
Contributor

Quentin-M commented Feb 6, 2026

Unfortunately, there are still a few issues despite the fix, e.g. running your fix, alongside Bedrock + Sonnet 4.5:

More specifically: When 'thinking' is enabled, a final 'assistant' message must start with a thinking block:

search the web for bitmex's february launches - and write that up in a nice pdf using md2pdf!

  ⎿ API Error: 400 {"error":{"message":"{\"message\":\"messages.5.content.0.type: Expected `thinking` or `redacted_thinking`, but found `text`. When `thinking` is enabled, a final `assistant` message must start with a thinking block 
    (preceeding the lastmost set of `tool_use` and `tool_result` blocks). We recommend you include thinking blocks from previous turns. To avoid this requirement, disable `thinking`. Please consult our documentation at 
    https://docs.claude.com/en/docs/build-with-claude/extended-thinking\"}. Received Model Group=claude-sonnet-4-5-20250929\nAvailable Model Group Fallbacks=None","type":"None","param":"None","code":"400"}}

My Claude put an attempt together to fix, but it's mostly wrong as it (and I) do not understands the APIs & flows correctly.

@ghost
Copy link

ghost commented Feb 6, 2026

@mpcusack-altos can you address the feedback from @Quentin-M so it can actually work for claude code?

would love to just have 1 pr which fixes this

cc: @jquinter for monitoring

@mpcusack-altos
Copy link
Contributor Author

PR was working for me yesterday against opus 4.5. I'll stress test it with opus+sonnet some more today and see if I can reproduce @Quentin-M's error. I think I can follow what it is asking us to do from the error message, but tough be sure if I can't reproduce.

@Quentin-M were you able to use web search at all? Did it fail in a fresh session or only in the middle of a chat? Does it work for you with opus 4-5? What version of claude code?

@Quentin-M
Copy link
Contributor

@mpcusack-altos Sorry for the late reply - I had to make a last minute trip after losing a family member.

Did it fail in a fresh session or only in the middle of a chat

Interesting question. I was starting fresh sessions to make a basic demo video.

Does it work for you with opus 4-5?

Actually trying to reproduce again right now with Opus 4.6 1M & Opus 4.5 - and web search seems to work well - even when asking Claude Code to think hard.

With that being said, we run LiteLLM with the following PR: https://github.com/BerriAI/litellm/pull/20489/commits, which carries various 'fixes' (all Claude generated without review) including fix(thinking): drop thinking param when assistant messages have text without thinking blocks as well as a few other thinking & beta headers fixes.

The logic might be flawed (I am no expert, and Claude certainly isn't one either), but that has allowed our various teams to work over the past few days... I have not heard complaints, and can't look back from central logs as we have disabled them for LiteLLM specifically due to LiteLLM leaking prompts/responses in a few places of the code despite having that generally disabled.

@Quentin-M
Copy link
Contributor

I've rebased my PR on your latest changes. My branch builds on top of yours, and fixes many more issues in order to make websearch / thinking work relatively well with claude code and the latest models.

When Anthropic's extended thinking is enabled, assistant messages must start with thinking blocks before tool_use blocks. The agentic loop was creating follow-up messages with only tool_use blocks, causing validation errors. This change ensures thinking blocks from the original response are preserved and included at the start of follow-up assistant messages.

- Created `TransformRequestResult` NamedTuple to capture both tool_calls and thinking_blocks from `transform_request()`, making the contract explicit and extensible
- Modified `transform_request()` to extract and return thinking/redacted_thinking blocks alongside tool calls
- Updated `transform_response()` to accept thinking_blocks and prepend them to follow-up assistant messages
- Passed thinking_blocks through the agentic loop chain: detection → execution → message transformation
- Fixed `transform_request()` to return full kwargs (not just tools) to preserve other request parameters
- Used `filter_internal_params()` utility instead of manual filtering for consistency

This change fixes websearch interception when extended thinking mode is enabled.

**Problem**: When Anthropic's extended thinking is enabled, assistant messages must start with thinking blocks before tool_use blocks. The agentic loop was creating follow-up messages with only tool_use blocks, causing the error: `messages.1.content.0.type: Expected 'thinking' or 'redacted_thinking', but found 'tool_use'`

**Solution**: Modified `transform_request()` to capture thinking/redacted_thinking blocks from the original response, and `transform_response()` to include them at the start of the assistant message in follow-up requests.

**Testing**: Successfully tested end-to-end with Claude Code → LiteLLM Proxy → AWS Bedrock → Claude Opus 4.5.

```yaml
model_list:
  - model_name: claude-opus-4-5-20251101
    litellm_params:
      model: bedrock/us.anthropic.claude-opus-4-5-20251101-v1:0
      aws_region_name: us-west-2
    model_info:
      supports_web_search: true
litellm_settings:
  callbacks: ["websearch_interception"]
  websearch_interception_params:
    enabled_providers: ["bedrock"]
    search_tool_name: "searxng-search"
search_tools:
  - search_tool_name: searxng-search
    litellm_params:
      search_provider: searxng
      api_base: "https://searxng.example.com"
```

**Note**: Uses `bedrock/` (not `bedrock/converse/`) to route through `anthropic_messages_handler()` which supports agentic hooks.
@ghost
Copy link

ghost commented Feb 18, 2026

@greptile please review this PR

@mpcusack-altos
Copy link
Contributor Author

mpcusack-altos commented Feb 19, 2026

I tested this some more today with sonnet/opus 4.5/4.6 and this PR as is is working for me. I haven't been able to reproduce the issues @Quentin-M is running into for some reason, although looking at his code I agree most/all of it are issues this code could hit.

Claude summary of those changes:

  1. max_tokens auto-adjust for budget_tokens (handler.py) - When thinking is enabled, the follow-up request in the agentic loop can fail with Anthropic's "max_tokens must be >        
  budget_tokens" error. The PR adds max_tokens = budget_tokens + DEFAULT_MAX_TOKENS when this condition is hit.                                                                         
  2. kwargs_for_followup dedup filtering (handler.py) - Prevents "got multiple values for keyword argument" errors (e.g., context_management being in both optional_params and kwargs). 
  Without this, follow-up requests can crash.                                                                                                                                           
  3. context_management stripping (3 Bedrock transformation files) - Bedrock doesn't support context_management as a body param. Without this, requests with context management enabled 
  will fail.
  4. adaptive thinking type (base_llm/chat/transformation.py) - Opus 4.6 uses "adaptive" thinking. Without this, is_thinking_enabled() returns false for Opus 4.6, breaking thinking    
  block handling.                                                                                                                                                                       
  5. Expanded thinking param drop (anthropic/chat/transformation.py, converse_transformation.py) - Drops thinking param when last assistant message has text (not just tool_calls)      
  without thinking blocks. Prevents "Expected thinking or redacted_thinking, but found text" errors.     

This combined with the thinking support I have in this PR results in complex and somewhat brittle code in the web search interceptor that really has nothing to do with web search. And I think will have us chasing our tails with every new claude code/models update.

I feel like it's an indication that the async_run_agentic_loop callback is wrong level of abstraction to use as it requires we craft the follow up request.

@krrishdholakia @jquinter You know litellm way better than I do, might there be a better hook the interceptor could use? Or some common code path it could delegate all the thinking block manipulation to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants