-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Feat: Add image extraction for webpages (media parsing) #818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
WalkthroughAdds a media extraction pipeline: new server-side mediaParser.extractImages(url); in-page scraper that extracts image/PDF text (alt/title, OCR, pdf.js) and posts Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User (click)
participant B as In-page Scraper
participant P as Parent Window (iframe)
participant R as DOMBrowserRenderer
participant S as Socket / Server
participant H as Input Handler
participant G as WorkflowGenerator
U->>B: click element (img / iframe / object)
alt image
B->>B: resolve url, read alt/title or OCR (tesseract)
else pdf/frame
B->>B: fetch/read PDF text (pdfjs)
end
B->>P: postMessage {type: "maxun:media-extracted", url, tag, selector, extractedText}
P->>R: message received (origin checked)
R->>S: socket.emit "dom:media-extracted" {url, tag, selector, extractedText}
S->>H: receive dom:media-extracted
H->>G: call handleMediaExtracted(data, page)
G->>G: append media event to workflowRecord.events (timestamp)
G-->>S: emit "workflow:media-added" (best-effort)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Worked on Issue #164 to add image extraction for webpages. Let me know if any improvements are needed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
mediaParser.js (6)
1-3
: Consider module system consistency.The code uses CommonJS (
require
) syntax. If the rest of the codebase uses ES6 modules (import/export
), consider updating for consistency.-const axios = require('axios'); -const cheerio = require('cheerio'); -const { URL } = require('url'); +import axios from 'axios'; +import * as cheerio from 'cheerio'; +import { URL } from 'url';
5-6
: Add input validation for the URL parameter.The function does not validate the
url
parameter before using it. Invalid or malicious input could cause unexpected behavior.async function extractImages(url) { + if (!url || typeof url !== 'string') { + throw new TypeError('URL must be a non-empty string'); + } try {
40-43
: Improve error handling and logging.Current issues:
console.log
is not appropriate for production error logging. Use a proper logging framework (e.g.,winston
, which is already a dependency).- Returning an empty array silently swallows errors, making debugging difficult for callers. Consider re-throwing the error or returning a Result/Either type.
If you want to maintain the current behavior of returning an empty array, at least use proper logging:
} catch (error) { - console.log('Oops! Something went wrong while fetching images:', error.message); + // Assuming winston logger is available + logger.error('Failed to extract images from URL', { url, error: error.message }); return []; }Alternatively, consider letting the error propagate to the caller:
} catch (error) { - console.log('Oops! Something went wrong while fetching images:', error.message); - return []; + throw new Error(`Failed to extract images from ${url}: ${error.message}`); }
7-8
: Configure axios with timeout and headers for reliability.The current implementation lacks:
- Timeout: Requests could hang indefinitely
- User-Agent: Some websites block requests without a proper User-Agent header
- Response size limits: Large responses could cause memory issues
- Redirect handling: Axios follows redirects by default, but this should be explicit
- // 1. Fetch HTML - const { data: html } = await axios.get(url); + // 1. Fetch HTML with proper configuration + const { data: html } = await axios.get(url, { + timeout: 10000, // 10 second timeout + maxContentLength: 10 * 1024 * 1024, // 10MB limit + maxBodyLength: 10 * 1024 * 1024, + headers: { + 'User-Agent': 'Mozilla/5.0 (compatible; MaxunBot/1.0; +https://maxun.dev)' + }, + maxRedirects: 5 + });
10-11
: LGTM with optional enhancement.The cheerio loading is correct. Optionally, you could add parsing options for better HTML handling:
- // 2. Load HTML into cheerio - const $ = cheerio.load(html); + // 2. Load HTML into cheerio with lenient parsing + const $ = cheerio.load(html, { + decodeEntities: true, + normalizeWhitespace: false + });
21-28
: Good URL resolution logic.The URL resolution correctly handles relative paths and silently skips invalid URLs.
Consider also extracting URLs from the
srcset
attribute for responsive images:$('img').each((index, element) => { let src = $(element).attr('src'); + const srcset = $(element).attr('srcset'); const alt = $(element).attr('alt') || ''; if (src) { // ... existing src handling ... } + + // Handle srcset for responsive images + if (srcset) { + const srcsetUrls = srcset.split(',').map(s => s.trim().split(/\s+/)[0]); + for (const srcsetUrl of srcsetUrls) { + try { + const absoluteUrl = new URL(srcsetUrl, url).href; + if (!seen.has(absoluteUrl)) { + seen.add(absoluteUrl); + images.push({ url: absoluteUrl, altText: alt }); + } + } catch { + // skip invalid URLs + } + } + } });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
mediaParser.js
(1 hunks)package.json
(2 hunks)
🔇 Additional comments (5)
mediaParser.js (3)
13-14
: LGTM!Appropriate data structures: array for results and Set for efficient deduplication.
16-20
: LGTM!The loop structure and attribute extraction are correct. Good handling of missing alt text with a default empty string.
30-35
: LGTM!Deduplication logic is correct and efficient. The result object structure
{ url, altText }
matches the PR requirements.package.json (2)
31-31
: cheerio version verified Cheerio ^1.1.2 is available on npm.
27-27
: Axios ^1.12.2 exists and addresses all known vulnerabilities.
"styled-components": "^5.3.3", | ||
"swagger-jsdoc": "^6.2.8", | ||
"swagger-ui-express": "^5.0.1", | ||
"tesseract.js": "^6.0.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused dependency: tesseract.js is not referenced in the current implementation.
The tesseract.js library was added but is not imported or used in mediaParser.js. Consider removing it unless it's planned for future use (e.g., OCR on images).
If tesseract.js is intended for future functionality, consider adding a comment in the code or creating a follow-up issue to track its implementation.
🤖 Prompt for AI Agents
In package.json around line 84, the dependency "tesseract.js": "^6.0.1" is
unused in the codebase (not imported in mediaParser.js); remove the tesseract.js
entry from package.json and run npm/yarn install to update lockfile, or if it is
intended for future OCR work, leave the dependency but add a short TODO comment
in the relevant module or create a follow-up issue referencing this dependency
and its planned usage so it is tracked.
…better error handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
mediaParser.js (2)
12-22
: Consider validating Content-Type before parsing.The fetch configuration is solid, but the function doesn't verify that the response is HTML. Parsing non-HTML content (JSON, XML, binary data) could lead to unexpected results or waste processing.
Add a Content-Type check after the axios call:
const { data: html } = await axios.get(url, { timeout: 10000, maxContentLength: 10 * 1024 * 1024, maxBodyLength: 10 * 1024 * 1024, headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MaxunBot/1.0; +https://maxun.dev)' }, maxRedirects: 5 }); + + // Validate Content-Type + const contentType = response.headers['content-type'] || ''; + if (!contentType.includes('text/html')) { + throw new Error(`Expected HTML but got ${contentType}`); + }Note: You'll need to capture the full response object instead of destructuring just
data
:- const { data: html } = await axios.get(url, { + const response = await axios.get(url, { timeout: 10000, maxContentLength: 10 * 1024 * 1024, maxBodyLength: 10 * 1024 * 1024, headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MaxunBot/1.0; +https://maxun.dev)' }, maxRedirects: 5 }); + + const contentType = response.headers['content-type'] || ''; + if (!contentType.includes('text/html')) { + throw new Error(`Expected HTML but got ${contentType}`); + } + + const html = response.data;
50-66
: Consider filtering empty srcset URLs.The srcset parsing logic is mostly correct, but malformed srcset attributes (e.g., trailing commas) could result in empty strings that get passed to
new URL()
, which would resolve to the base URL.Filter out empty URLs before processing:
const srcset = $(element).attr('srcset'); if (srcset) { - const srcsetUrls = srcset.split(',').map(s => s.trim().split(/\s+/)[0]); + const srcsetUrls = srcset.split(',') + .map(s => s.trim().split(/\s+/)[0]) + .filter(Boolean); // Remove empty strings for (const srcsetUrl of srcsetUrls) { try { const absoluteUrl = new URL(srcsetUrl, url).href; if (!seen.has(absoluteUrl)) { seen.add(absoluteUrl); images.push({ url: absoluteUrl, altText: alt }); } } catch { logger.warn(`Invalid srcset URL: ${srcsetUrl}`); } } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
mediaParser.js
(1 hunks)
🔇 Additional comments (4)
mediaParser.js (4)
6-10
: LGTM: Input validation is appropriate.The function correctly validates that the URL is a non-empty string and throws a TypeError for invalid input.
24-28
: LGTM: Cheerio configuration is appropriate.The cheerio load options correctly decode HTML entities (useful for alt text) while preserving original whitespace.
30-48
: LGTM: src handling is correct.The function properly:
- Extracts the src attribute and alt text
- Resolves relative URLs to absolute URLs
- Deduplicates images using a Set
- Handles invalid URLs gracefully with logging
68-73
: LGTM: Error handling is appropriate.The function correctly logs errors with context and throws a descriptive error to propagate failures upstream, which is better than silently returning an empty array.
- Switch to ES6 module syntax (import/export) - Add URL input validation - Configure axios for timeout, headers, content limits - Validate Content-Type before parsing HTML - Handle srcset with empty URL filtering - Remove duplicate images - Use Winston logger for warnings and errors - Add human-friendly JSDoc comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
mediaParser.js (1)
101-101
: Critical: Missing module export.The
extractImages
function is still not exported, making it unusable by other modules. This issue was flagged in a previous review and remains unresolved.Add an export statement at the end of the file:
} } + +export { extractImages };Or for CommonJS:
} } + +module.exports = { extractImages };
🧹 Nitpick comments (2)
mediaParser.js (2)
52-92
: Consider handling edge cases.The image extraction logic is solid with proper deduplication and URL resolution. However, consider these optional refinements:
Data URLs: The current implementation includes
data:
URLs, which may bloat the results. Consider filtering them if not needed:if (absoluteUrl.startsWith('data:')) continue;
<picture>
elements: Modern responsive images often use<picture>
with<source>
tags. Consider extracting from<source srcset="...">
within<picture>
elements for completeness.
94-100
: Preserve error stack trace.The error handling logs and rethrows correctly, but the original error's stack trace is lost when creating a new
Error
. Consider preserving it for better debugging:- throw new Error(`Failed to extract images from ${url}: ${error.message}`); + error.message = `Failed to extract images from ${url}: ${error.message}`; + throw error;Or use the newer
cause
option (Node 16.9+):- throw new Error(`Failed to extract images from ${url}: ${error.message}`); + throw new Error(`Failed to extract images from ${url}`, { cause: error });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
mediaParser.js
(1 hunks)
🔇 Additional comments (5)
mediaParser.js (5)
1-4
: LGTM!Imports are correct. Cheerio is properly imported as a namespace (required for calling
cheerio.load()
), and axios is at the latest stable version.Based on learnings.
6-19
: LGTM!JSDoc is comprehensive and accurately describes the function's behavior, parameters, return type, and error conditions.
20-24
: LGTM!Input validation correctly checks for non-empty string and throws the appropriate
TypeError
.
26-42
: LGTM with a consideration.The axios configuration is appropriate for web scraping with reasonable limits and a custom User-Agent. The Content-Type validation ensures HTML is received.
One consideration: The 10-second timeout may be short for slow-loading pages. If this becomes an issue in production, consider making it configurable or increasing it.
44-50
: LGTM!Cheerio is loaded correctly with appropriate options. The
decodeEntities: true
option ensures HTML entities are decoded properly, which is important for alt text extraction.Based on learnings.
- Export extractImages function for module usage - Preserve original error stack trace using 'cause' - Validate Content-Type before parsing HTML - Filter out data: URLs to avoid bloating results - Handle <picture> <source> tags for responsive images - Deduplicate images - Maintain Winston logging for errors and warnings - Human-friendly JSDoc comments and ES6 syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
mediaParser.js (2)
27-37
: Consider adding explicit status code validation.The axios configuration is solid with appropriate timeout, size limits, and User-Agent. However, axios will throw for non-2xx status codes by default, which may not provide the clearest error messages.
Consider adding explicit status validation for better error handling:
const response = await axios.get(url, { timeout: 10000, maxContentLength: 10 * 1024 * 1024, maxBodyLength: 10 * 1024 * 1024, headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MaxunBot/1.0; +https://maxun.dev)' }, - maxRedirects: 5 + maxRedirects: 5, + validateStatus: (status) => status >= 200 && status < 300 });Alternatively, you could catch and re-throw with a clearer message for HTTP errors:
try { const response = await axios.get(url, { /* config */ }); } catch (axiosError) { if (axios.isAxiosError(axiosError) && axiosError.response) { throw new Error(`HTTP ${axiosError.response.status}: Failed to fetch ${url}`); } throw axiosError; }
92-112
: Consider linking source URLs to their picture's img alt text.The extraction logic for
<picture>
elements works correctly, but source URLs are assigned an emptyaltText
. In HTML, the<img>
inside<picture>
contains the alt text that applies to all sources.This is acceptable given the complexity of associating sources with their corresponding img, but if you want more complete alt text coverage, consider this enhancement:
// Extract <picture> elements with their sources and img $('picture').each((i, pictureElement) => { const $picture = $(pictureElement); const alt = $picture.find('img').attr('alt') || ''; $picture.find('source').each((j, sourceElement) => { const srcset = $(sourceElement).attr('srcset'); if (srcset) { const srcsetUrls = srcset.split(',') .map(s => s.trim().split(/\s+/)[0]) .filter(Boolean); for (const srcsetUrl of srcsetUrls) { try { const absoluteUrl = new URL(srcsetUrl, url).href; if (!seen.has(absoluteUrl) && !absoluteUrl.startsWith('data:')) { seen.add(absoluteUrl); images.push({ url: absoluteUrl, altText: alt }); } } catch { logger.warn(`Invalid srcset URL in <source>: ${srcsetUrl}`); } } } }); });This would apply the picture's img alt text to all source URLs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
mediaParser.js
(1 hunks)
🔇 Additional comments (8)
mediaParser.js (8)
3-6
: LGTM with minor note on URL import.The imports are correctly structured. Note that in Node.js v10+,
URL
is available as a global, so the import from 'url' is valid but technically optional. Keeping it explicit is fine for clarity and compatibility.
8-21
: Excellent documentation.The JSDoc comment clearly describes the function's behavior, parameters, return type, and error conditions. This will help maintainers and users of this module.
22-25
: LGTM!Input validation correctly checks for non-empty string and throws the appropriate
TypeError
for invalid input.
39-43
: LGTM!Content-type validation correctly ensures HTML content before parsing. The default to empty string handles missing headers gracefully, and
.includes()
properly matches content-type with charset parameters.
45-49
: LGTM!Cheerio is loaded with appropriate options. Decoding entities is necessary for proper HTML parsing, and preserving whitespace ensures alt text accuracy.
54-90
: LGTM! Robust image extraction logic.The extraction logic correctly:
- Handles both
src
andsrcset
attributes- Converts relative URLs to absolute using the
URL
constructor- Deduplicates images using a Set
- Filters out data URLs
- Preserves alt text across srcset variants
- Logs invalid URLs without crashing
The srcset parsing correctly splits by comma and extracts the URL portion (before width/pixel descriptors).
114-119
: LGTM! Excellent error handling.The error handling correctly wraps the original error with context (the URL) while preserving the stack trace using the
cause
property. This is a modern best practice for error chaining.
122-123
: LGTM! Export issue resolved.The function is now properly exported using ES6 named export syntax, addressing the critical issue from the previous review. The module is now usable by other modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution. However, this PR addresses a different problem than what was requested in issue #164.
-
The feature needs to enable extraction of data from web hosted media files like images, PDFs or other document types.
-
The feature must be integrated with the actual robot recording flow (while training the robot) and should not be a standalone script. It should be inside the recorder.
For example, if there is an image on a webpage with any written content, the user must be able to extract that content from the image.
Thanks for the feedback! I understand that this PR doesn’t address the requested feature in issue #164 . I’ll work on updating it to extract data from web-hosted media files and integrate it directly into the robot recording flow. I’ll submit an updated PR soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
package.json (1)
31-85
: Remove unused tesseract.js dependency.
tesseract.js
is still not imported anywhere—the browser scraper loads it from a CDN instead—so keeping it in dependencies only bloats install surface and supply-chain risk. Drop it until we actually bundle it.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
maxun-core/src/browserSide/scraper.js
(1 hunks)package.json
(3 hunks)server/src/browser-management/inputHandlers.ts
(2 hunks)server/src/workflow-management/classes/Generator.ts
(1 hunks)src/components/recorder/DOMBrowserRenderer.tsx
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/browser-management/inputHandlers.ts (2)
server/src/workflow-management/classes/Generator.ts (1)
handleMediaExtracted
(156-180)server/src/browser-management/classes/RemoteBrowser.ts (1)
RemoteBrowser
(128-2029)
Media Parsing: Image Extraction
Closes Issue #164
Summary by CodeRabbit
New Features
Behavior Changes
Chores