Add vision/image analysis support to worker prompts by buremba · Pull Request #149 · lobu-ai/lobu

buremba · 2026-03-04T10:45:42Z

Description

This PR adds multi-modal prompt support to the OpenClaw worker, enabling vision analysis of uploaded images. When image files are present in user uploads, they are now automatically included in the prompt as image content blocks rather than just being referenced as file paths.

Key Changes

Import ImageContent and TextContent types from @mariozechner/pi-ai
Add IMAGE_MIME_TYPES static set to identify supported image formats (PNG, JPEG, GIF, WebP)
Implement getImageContentBlocks() method to load and convert image files to base64-encoded content blocks
Modify prompt handling to build multi-modal prompts when images are present, combining text and image content
Update user files section messaging to distinguish between image files (included directly) and non-image files (accessible via commands)
Add logging for image loading operations and vision analysis

Type of Change

New feature (non-breaking change which adds functionality)

Testing

Existing prompt handling remains unchanged when no images are present
Multi-modal prompt path is only triggered when image files are detected
Image loading includes error handling with warning logs for failed reads
File type detection uses MIME type checking against a predefined set

Checklist

Code follows project style guidelines
Changes are backward compatible (text-only prompts work as before)
Error handling included for image loading failures
Logging added for debugging vision analysis operations

Additional Notes

The implementation gracefully degrades to text-only prompts when no images are present, ensuring no impact on existing functionality. Image files are loaded from the workspace input directory and converted to base64 for inclusion in the multi-modal prompt.

https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4

When users send images via Telegram/WhatsApp, the worker now reads them as base64 and includes them as ImageContent blocks in the prompt sent to the LLM, enabling vision-capable models to actually see and analyze uploaded images instead of just knowing the file path. https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4

The installed Pi SDK version uses prompt(text, {images}) not multi-modal content arrays. Fixed to pass ImageContent[] via the images option parameter. https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0531ae8081

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-04T10:50:34Z

+
+      // Build multi-modal prompt when images are present
+      const imageBlocks = await this.getImageContentBlocks();
+      if (imageBlocks.length > 0) {


Gate image prompts on model vision capability

This branch sends a multimodal payload whenever any uploaded file is classified as an image, but there is no check that the selected provider/model can accept image content. In sessions using text-only models, session.prompt(...) will fail on these image blocks and the whole request errors out, whereas the previous behavior still allowed processing via local file tools. Please gate this path on model capabilities (or catch/retry with text-only prompt) so image uploads don’t hard-fail non-vision models.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-04T10:50:35Z

+        const data = await fs.readFile(filePath);
+        const base64Data = data.toString("base64");


Bound image file size before base64 encoding

Each image is loaded fully into memory and then expanded to base64 with no per-file or total-size cap. Large uploads (or multiple medium images) can create very large in-memory payloads and oversized model requests, causing OOM/restarts or upstream request rejection. Add a size budget (and fallback behavior) before calling readFile/toString("base64") to keep worker memory and prompt size bounded.

Useful? React with 👍 / 👎.

- Use mimetype.startsWith("image/") instead of hardcoded MIME set - Pi SDK handles provider-specific vision support transparently - Works for all providers and all platforms (Telegram, WhatsApp, Slack) - Consolidate and clean up instructions https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4

- Extract shared `uploadedFiles` getter and `isImage()` helper - Remove duplicated file access across 3 methods - Remove unnecessary conditional branch (always pass images option) - Add 20MB size guard to skip oversized images - Rename to `loadImageAttachments()` for clarity - Remove `(f: any)` casts where typed getter is used https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4

buremba · 2026-03-04T19:50:14Z

@codex solve these issues

chatgpt-codex-connector · 2026-03-04T19:52:44Z

Summary

Added a safe fallback for multimodal prompting: the worker now first tries session.prompt(..., { images }), and if that fails with image attachments present (e.g., text-only model), it logs a warning and retries with a text-only prompt so requests don’t hard-fail. packages/worker/src/openclaw/worker.tsL762-L781
Updated the user file hint text to clarify that direct image inclusion depends on model vision support and attachment size constraints. packages/worker/src/openclaw/worker.tsL988-L999
Hardened image loading with pre-read size checks and a total prompt image budget:
- per-file check via fs.stat before readFile,
- total embedded size cap across all images,
- skip + warning behavior when limits are exceeded. packages/worker/src/openclaw/worker.tsL1021-L1077

Testing

⚠️ curl -I 'https://duckduckgo.com/?q=base64+size+overhead+multimodal+capability+gating' | head (blocked by network proxy: HTTP 403 CONNECT tunnel failure)
❌ make dev (failed: docker binary not available in this environment)
✅ make clean-workers (completed; reports docker not found during cleanup path but exits successfully)
❌ ./scripts/test-bot.sh "@me test prompt" (failed: TEST_PLATFORM not configured)
❌ make build-packages (failed due missing local TS/Bun test typing dependencies, e.g. bun:test / Node types)

Committed and opened a follow-up PR record via make_pr with title: “Follow-up: guard multimodal image prompts and cap attachment size”.

View task →

Includes: - #150 PATH-based agent detection (replaces MCP session inference) - #152 auto-pick default agent when defaultAgent is unset - #155 distinct SF Symbol icons per CLI agent kind - #154 WATCHERS section in menubar with last-fired + manual trigger

claude added 2 commits March 4, 2026 10:11

fix: use PromptOptions.images for vision input (matches Pi SDK API)

2789802

The installed Pi SDK version uses prompt(text, {images}) not multi-modal content arrays. Fixed to pass ImageContent[] via the images option parameter. https://claude.ai/code/session_01JE1v44vrqYVTbrmkA25CR4

chatgpt-codex-connector Bot reviewed Mar 4, 2026

View reviewed changes

claude added 2 commits March 4, 2026 10:52

buremba merged commit f6eb4c4 into main Mar 4, 2026
8 of 9 checks passed

buremba deleted the claude/github-issues-review-OxzRu branch April 21, 2026 21:41

buremba mentioned this pull request May 17, 2026

chore: bump owletto submodule (#149 → #154) #825

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision/image analysis support to worker prompts#149

Add vision/image analysis support to worker prompts#149
buremba merged 4 commits into
mainfrom
claude/github-issues-review-OxzRu

buremba commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Uh oh!

buremba commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const data = await fs.readFile(filePath);
		const base64Data = data.toString("base64");

Conversation

buremba commented Mar 4, 2026

Description

Key Changes

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

buremba commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 4, 2026

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants