Skip to content

Add vision/image analysis support to worker prompts#149

Merged
buremba merged 4 commits into
mainfrom
claude/github-issues-review-OxzRu
Mar 4, 2026
Merged

Add vision/image analysis support to worker prompts#149
buremba merged 4 commits into
mainfrom
claude/github-issues-review-OxzRu

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented Mar 4, 2026

Description

This PR adds multi-modal prompt support to the OpenClaw worker, enabling vision analysis of uploaded images. When image files are present in user uploads, they are now automatically included in the prompt as image content blocks rather than just being referenced as file paths.

Key Changes

  • Import ImageContent and TextContent types from @mariozechner/pi-ai
  • Add IMAGE_MIME_TYPES static set to identify supported image formats (PNG, JPEG, GIF, WebP)
  • Implement getImageContentBlocks() method to load and convert image files to base64-encoded content blocks
  • Modify prompt handling to build multi-modal prompts when images are present, combining text and image content
  • Update user files section messaging to distinguish between image files (included directly) and non-image files (accessible via commands)
  • Add logging for image loading operations and vision analysis

Type of Change

  • New feature (non-breaking change which adds functionality)

Testing

  • Existing prompt handling remains unchanged when no images are present
  • Multi-modal prompt path is only triggered when image files are detected
  • Image loading includes error handling with warning logs for failed reads
  • File type detection uses MIME type checking against a predefined set

Checklist

  • Code follows project style guidelines
  • Changes are backward compatible (text-only prompts work as before)
  • Error handling included for image loading failures
  • Logging added for debugging vision analysis operations

Additional Notes

The implementation gracefully degrades to text-only prompts when no images are present, ensuring no impact on existing functionality. Image files are loaded from the workspace input directory and converted to base64 for inclusion in the multi-modal prompt.

https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4

claude added 2 commits March 4, 2026 10:11
When users send images via Telegram/WhatsApp, the worker now reads them
as base64 and includes them as ImageContent blocks in the prompt sent
to the LLM, enabling vision-capable models to actually see and analyze
uploaded images instead of just knowing the file path.

https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4
The installed Pi SDK version uses prompt(text, {images}) not
multi-modal content arrays. Fixed to pass ImageContent[] via
the images option parameter.

https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0531ae8081

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/worker/src/openclaw/worker.ts Outdated

// Build multi-modal prompt when images are present
const imageBlocks = await this.getImageContentBlocks();
if (imageBlocks.length > 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate image prompts on model vision capability

This branch sends a multimodal payload whenever any uploaded file is classified as an image, but there is no check that the selected provider/model can accept image content. In sessions using text-only models, session.prompt(...) will fail on these image blocks and the whole request errors out, whereas the previous behavior still allowed processing via local file tools. Please gate this path on model capabilities (or catch/retry with text-only prompt) so image uploads don’t hard-fail non-vision models.

Useful? React with 👍 / 👎.

Comment thread packages/worker/src/openclaw/worker.ts Outdated
Comment on lines +1017 to +1018
const data = await fs.readFile(filePath);
const base64Data = data.toString("base64");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound image file size before base64 encoding

Each image is loaded fully into memory and then expanded to base64 with no per-file or total-size cap. Large uploads (or multiple medium images) can create very large in-memory payloads and oversized model requests, causing OOM/restarts or upstream request rejection. Add a size budget (and fallback behavior) before calling readFile/toString("base64") to keep worker memory and prompt size bounded.

Useful? React with 👍 / 👎.

claude added 2 commits March 4, 2026 10:52
- Use mimetype.startsWith("image/") instead of hardcoded MIME set
- Pi SDK handles provider-specific vision support transparently
- Works for all providers and all platforms (Telegram, WhatsApp, Slack)
- Consolidate and clean up instructions

https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4
- Extract shared `uploadedFiles` getter and `isImage()` helper
- Remove duplicated file access across 3 methods
- Remove unnecessary conditional branch (always pass images option)
- Add 20MB size guard to skip oversized images
- Rename to `loadImageAttachments()` for clarity
- Remove `(f: any)` casts where typed getter is used

https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4
@buremba
Copy link
Copy Markdown
Member Author

buremba commented Mar 4, 2026

@codex solve these issues

@chatgpt-codex-connector
Copy link
Copy Markdown

Summary

  • Added a safe fallback for multimodal prompting: the worker now first tries session.prompt(..., { images }), and if that fails with image attachments present (e.g., text-only model), it logs a warning and retries with a text-only prompt so requests don’t hard-fail. packages/worker/src/openclaw/worker.tsL762-L781
  • Updated the user file hint text to clarify that direct image inclusion depends on model vision support and attachment size constraints. packages/worker/src/openclaw/worker.tsL988-L999
  • Hardened image loading with pre-read size checks and a total prompt image budget:

Testing

  • ⚠️ curl -I 'https://duckduckgo.com/?q=base64+size+overhead+multimodal+capability+gating' | head (blocked by network proxy: HTTP 403 CONNECT tunnel failure)
  • make dev (failed: docker binary not available in this environment)
  • make clean-workers (completed; reports docker not found during cleanup path but exits successfully)
  • ./scripts/test-bot.sh "@me test prompt" (failed: TEST_PLATFORM not configured)
  • make build-packages (failed due missing local TS/Bun test typing dependencies, e.g. bun:test / Node types)

Committed and opened a follow-up PR record via make_pr with title: “Follow-up: guard multimodal image prompts and cap attachment size”.

View task →

@buremba buremba merged commit f6eb4c4 into main Mar 4, 2026
8 of 9 checks passed
@buremba buremba deleted the claude/github-issues-review-OxzRu branch April 21, 2026 21:41
buremba added a commit that referenced this pull request May 17, 2026
Includes:
- #150 PATH-based agent detection (replaces MCP session inference)
- #152 auto-pick default agent when defaultAgent is unset
- #155 distinct SF Symbol icons per CLI agent kind
- #154 WATCHERS section in menubar with last-fired + manual trigger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants