Support ImageContent and AudioContent in sampling handlers#3550
Conversation
Co-authored-by: Claude <noreply@anthropic.com>
Test Failure AnalysisSummary: Two failures: (1) Root Cause:
Suggested Solution: For the real bug (#1), bump the minimum # pyproject.toml
anthropic = ["anthropic>=0.48.0"]The integration test timeout (#2) is a pre-existing flaky test. No action required from this PR. Detailed AnalysisFailure 1:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 15196dabe6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # When there are tool calls, content must be text-only | ||
| text_str = "\n".join(text_parts) or None | ||
| openai_messages.append( | ||
| ChatCompletionAssistantMessageParam( | ||
| role="assistant", |
There was a problem hiding this comment.
Reject assistant image/audio parts when tool_calls are present
If an assistant turn contains ToolUseContent plus ImageContent/AudioContent, this branch only serializes text_parts and never validates or forwards the non-text content_parts. The non-tool_calls path already raises for assistant multimodal content, and OpenAI's chat-completions assistant messages only accept text/refusal parts, so this now silently drops image/audio context instead of failing fast. That will change the conversation state for any mixed assistant turn and is very hard to diagnose.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in aa76c47. Both the tool_calls and non-tool_calls branches now raise ValueError for image/audio in assistant messages.
| if isinstance(content, ImageContent): | ||
| return Part( | ||
| inline_data=Blob( | ||
| data=base64.b64decode(content.data), | ||
| mime_type=content.mimeType, |
There was a problem hiding this comment.
Validate Gemini media MIME types before building inline_data
Gemini only accepts a fixed allowlist of image and audio MIME types, but this new conversion path passes content.mimeType straight through without any checks (the audio branch immediately below does the same). Inputs like image/bmp or common mobile recordings such as audio/m4a will now get all the way to generate_content() and fail with provider-side 400s instead of the early, clear ValueError the other handlers return for unsupported media.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Gemini supports a much wider range of media types than OpenAI/Anthropic — a restrictive allowlist would reject valid inputs. The API validates and returns a clear error for unsupported types.
15196da to
db9c1e2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa76c47704
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if isinstance(content, ImageContent): | ||
| anthropic_messages.append( | ||
| MessageParam( | ||
| role=message.role, | ||
| content=[_image_content_to_anthropic_block(content)], |
There was a problem hiding this comment.
Reject assistant-side image blocks before calling Anthropic
_convert_to_anthropic_messages() now forwards ImageContent with role=message.role, but Anthropic's Messages API only accepts image blocks in user turns; assistant turns are limited to model text/tool-use content. If a sampling history includes an assistant ImageContent (or a list-content assistant turn with an image), this code will build an invalid payload and the next request fails with a provider-side 400 instead of a local validation error.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 2298e76. Both single-content and list-content paths now validate that ImageContent is only in user messages.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2298e7639a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| elif isinstance(item, ImageContent): | ||
| content_parts.append(_image_content_to_openai_part(item)) | ||
| elif isinstance(item, AudioContent): | ||
| content_parts.append(_audio_content_to_openai_part(item)) |
There was a problem hiding this comment.
Reject non-text media in OpenAI tool results
This change makes ImageContent/AudioContent valid elsewhere in _convert_to_openai_messages(), but both ToolResultContent branches still build role="tool" messages from TextContent only. When an MCP tool returns content=[ImageContent(...)] or AudioContent(...), the payload is silently stripped and OpenAI receives an empty tool result, so the model never sees the screenshot/audio it asked for. OpenAI's chat-completions docs say tool messages only support text parts, so this path should fail fast instead of dropping the media.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pre-existing limitation — ToolResultContent has always only extracted TextContent. This PR adds multimodal support for messages, not tool results. Valid follow-up but out of scope here.
| elif isinstance(item, ImageContent): | ||
| if message.role != "user": | ||
| raise ValueError( | ||
| "ImageContent is only supported in user messages " | ||
| "for Anthropic" |
There was a problem hiding this comment.
Preserve images inside Anthropic tool_result blocks
Anthropic's tool-use API allows tool_result.content to contain nested text or image blocks, but _convert_to_anthropic_messages() still serializes ToolResultContent.content by collecting only TextContent. If a FastMCP tool returns content=[ImageContent(...)], Claude will receive an empty tool result instead of the image it requested. Since this commit adds ImageContent support for user messages, the same media needs to be forwarded (or explicitly rejected) in ToolResultContent as well.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Same — pre-existing, unchanged by this PR. Anthropic does support images in tool_result blocks so it's a valid follow-up.
| if isinstance(content, ImageContent): | ||
| return Part( | ||
| inline_data=Blob( | ||
| data=base64.b64decode(content.data), | ||
| mime_type=content.mimeType, |
There was a problem hiding this comment.
Handle media blocks in Gemini function responses too
The new ImageContent/AudioContent branches only cover top-level message parts. _sampling_content_to_google_genai_part() still raises from its ToolResultContent branch on any non-TextContent, so a tool returning ImageContent or AudioContent will still fail during sampling. Gemini's FunctionResponsePart supports inlineData blobs, so this leaves the new multimodal support incomplete for the common tool-result path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Same — pre-existing, unchanged by this PR. Multimodal tool results are a valid follow-up.
The MCP specification explicitly includes
ImageContentandAudioContentin sampling messages — a server can send an image to the client and ask the LLM to analyze it viasampling/createMessage. But all three sampling handlers (OpenAI, Anthropic, Google GenAI) only handledTextContent, silently dropping multimodal content in list contexts or raising opaque errors for single content.Each handler now converts these content types to its provider's native format, with MIME type validation that fails early with clear messages rather than letting the provider API reject them later. Audio support varies by provider — OpenAI supports wav/mp3, Google GenAI accepts any audio MIME type, and Anthropic doesn't support audio at all (raises
ValueErrorimmediately).Closes #3544