Add vision/image analysis support to worker prompts#149
Conversation
When users send images via Telegram/WhatsApp, the worker now reads them as base64 and includes them as ImageContent blocks in the prompt sent to the LLM, enabling vision-capable models to actually see and analyze uploaded images instead of just knowing the file path. https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4
The installed Pi SDK version uses prompt(text, {images}) not
multi-modal content arrays. Fixed to pass ImageContent[] via
the images option parameter.
https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0531ae8081
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| // Build multi-modal prompt when images are present | ||
| const imageBlocks = await this.getImageContentBlocks(); | ||
| if (imageBlocks.length > 0) { |
There was a problem hiding this comment.
Gate image prompts on model vision capability
This branch sends a multimodal payload whenever any uploaded file is classified as an image, but there is no check that the selected provider/model can accept image content. In sessions using text-only models, session.prompt(...) will fail on these image blocks and the whole request errors out, whereas the previous behavior still allowed processing via local file tools. Please gate this path on model capabilities (or catch/retry with text-only prompt) so image uploads don’t hard-fail non-vision models.
Useful? React with 👍 / 👎.
| const data = await fs.readFile(filePath); | ||
| const base64Data = data.toString("base64"); |
There was a problem hiding this comment.
Bound image file size before base64 encoding
Each image is loaded fully into memory and then expanded to base64 with no per-file or total-size cap. Large uploads (or multiple medium images) can create very large in-memory payloads and oversized model requests, causing OOM/restarts or upstream request rejection. Add a size budget (and fallback behavior) before calling readFile/toString("base64") to keep worker memory and prompt size bounded.
Useful? React with 👍 / 👎.
- Use mimetype.startsWith("image/") instead of hardcoded MIME set
- Pi SDK handles provider-specific vision support transparently
- Works for all providers and all platforms (Telegram, WhatsApp, Slack)
- Consolidate and clean up instructions
https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4
- Extract shared `uploadedFiles` getter and `isImage()` helper - Remove duplicated file access across 3 methods - Remove unnecessary conditional branch (always pass images option) - Add 20MB size guard to skip oversized images - Rename to `loadImageAttachments()` for clarity - Remove `(f: any)` casts where typed getter is used https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4
|
@codex solve these issues |
Summary
Testing
Committed and opened a follow-up PR record via |
Description
This PR adds multi-modal prompt support to the OpenClaw worker, enabling vision analysis of uploaded images. When image files are present in user uploads, they are now automatically included in the prompt as image content blocks rather than just being referenced as file paths.
Key Changes
ImageContentandTextContenttypes from@mariozechner/pi-aiIMAGE_MIME_TYPESstatic set to identify supported image formats (PNG, JPEG, GIF, WebP)getImageContentBlocks()method to load and convert image files to base64-encoded content blocksType of Change
Testing
Checklist
Additional Notes
The implementation gracefully degrades to text-only prompts when no images are present, ensuring no impact on existing functionality. Image files are loaded from the workspace input directory and converted to base64 for inclusion in the multi-modal prompt.
https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4