-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: scrape [html + markdown] #889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds scrape-type robots: Playwright-based page fetch/cleanup, Turndown-based HTML→Markdown conversion, new scrape endpoints and UI, and propagation of requested output formats (markdown/html) through API, worker/scheduler, run serialization, sockets, and webhooks. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant FE as Frontend
participant API as Backend (/storage)
participant DB as Database
User->>FE: Create scrape robot (url, name, formats)
FE->>API: POST /recordings/scrape { url, name, formats }
API->>API: Validate url & formats
API->>DB: Insert Robot (meta.type = scrape, url, formats)
DB-->>API: Created Robot
API-->>FE: 201 Created (robot)
sequenceDiagram
participant Client
participant Scheduler as scheduler/index.ts
participant Worker as pgboss-worker.ts
participant Scrape as markdownify/scrape.ts
participant Markdown as markdownify/markdown.ts
participant Socket as WebSocket
participant Webhook as Webhook
rect rgba(200,230,255,0.5)
Client->>Scheduler: POST /robots/:id/runs (requestedFormats?)
Scheduler->>Worker: start run (includes formats)
Worker->>Socket: emit run-started
end
rect rgba(220,255,200,0.5)
Worker->>Scrape: convertPageToMarkdown/convertPageToHTML(url)
Scrape->>Scrape: Playwright fetch + sanitize DOM
Scrape->>Markdown: parseMarkdown(cleanedHtml, baseUrl)
Markdown-->>Scrape: markdown text
Scrape-->>Worker: conversion results (markdown/html)
end
rect rgba(255,230,200,0.5)
Worker->>DB: update run (serializableOutput, status: success/failed)
Worker->>Socket: emit run-completed
Worker->>Webhook: send run_completed payload (with markdown/html if present)
Worker-->>Client: run result
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/components/robot/pages/RobotEditPage.tsx (1)
723-734: Fix potential runtime error ingetTargetUrlwhenwhatis missing
getTargetUrlcurrently does:const lastPair = robot?.recording.workflow[robot?.recording.workflow.length - 1]; url = lastPair?.what.find((action) => action.action === "goto")?.args?.[0];If
lastPairexists butlastPair.whatisundefined/null,lastPair?.whatevaluates toundefinedand the subsequent.findwill throw. A safer version would guard bothworkflowlength andwhat:- const lastPair = - robot?.recording.workflow[robot?.recording.workflow.length - 1]; - url = lastPair?.what.find((action) => action.action === "goto") - ?.args?.[0]; + const workflow = robot?.recording?.workflow; + const lastPair = + workflow && workflow.length > 0 + ? workflow[workflow.length - 1] + : undefined; + const gotoAction = lastPair?.what?.find( + (action) => action.action === "goto" + ); + url = gotoAction?.args?.[0];Same pattern appears in
RobotDuplicatePage.tsx; consider applying the same fix (or factoring this into a shared helper) to avoid similar crashes in both places.Also applies to: 817-823
🧹 Nitpick comments (10)
src/components/run/RunContent.tsx (1)
727-735: Extract inline download logic to improve consistency.The HTML download button contains inline download logic, while the Markdown download uses the extracted
downloadMarkdownfunction. For consistency and maintainability, extract the HTML download logic to a separate function.Add this function near the
downloadMarkdownfunction (after line 413):const downloadHTML = (content: string, filename: string) => { const blob = new Blob([content], { type: 'text/html;charset=utf-8;' }); const url = URL.createObjectURL(blob); const link = document.createElement("a"); link.href = url; link.setAttribute("download", filename); document.body.appendChild(link); link.click(); document.body.removeChild(link); setTimeout(() => { URL.revokeObjectURL(url); }, 100); };Then update the button onClick:
<Box sx={{ mt: 2 }}> <Button - onClick={() => { - const blob = new Blob([htmlContent], { type: 'text/html;charset=utf-8;' }); - const url = URL.createObjectURL(blob); - const link = document.createElement("a"); - link.href = url; - link.download = "output.html"; - link.click(); - setTimeout(() => URL.revokeObjectURL(url), 100); - }} + onClick={() => downloadHTML(htmlContent, 'output.html')} sx={{ color: '#FF00C3', textTransform: 'none' }} > Download </Button> </Box>src/components/robot/pages/RobotIntegrationPage.tsx (1)
734-781: Add user feedback for disabled integrations.While the buttons are correctly disabled for scrape robots with appropriate visual feedback, users receive no explanation for why these integrations are unavailable. Consider adding a tooltip or informational message.
Add a tooltip to inform users:
<Button variant="outlined" disabled={isScrapeRobot} + title={isScrapeRobot ? "Google Sheets integration is not available for scrape-type robots" : ""} onClick={() => { if (isScrapeRobot) return;Alternatively, add an informational alert above the integration buttons when
isScrapeRobotis true:{isScrapeRobot && ( <Alert severity="info" sx={{ mb: 2, maxWidth: "900px" }}> Google Sheets and Airtable integrations are not available for scrape-type robots. Use Webhooks or MCP for scrape robot integrations. </Alert> )}src/components/robot/pages/RobotDuplicatePage.tsx (1)
19-30: RobotMeta type narrowing and formats look consistent; consider centralizing the typeThe additions of
type?: 'extract' | 'scrape'andformats?: ('markdown' | 'html')[]are aligned with the new scrape/markdown flow and are type-safe here. Since the sameRobotMetashape now exists in multiple pages (e.g.,RobotEditPage.tsx), consider moving this interface to a shared types module to avoid drift the next time metadata fields change.server/src/pgboss-worker.ts (1)
192-303: Avoid double failure handling for scrape runs inprocessRunExecutionThe scrape branch already updates the run to
failed, emitsrun-completed(failed) events, and destroys the browser inside itscatch. Re‑throwing the error here causes the outercatch (executionError)to run as well, which:
- updates the run status/log again,
- emits another
run-completedfailure event,- and sends a separate
run_failedwebhook.This duplication makes failure behavior harder to reason about and can spam downstream consumers.
Consider either:
- Not rethrowing from the scrape
catch(return{ success: false }after local handling), or- Adding a flag to skip the generic failure path when the scrape branch has already fully handled the error.
That keeps scrape failures single-sourced and easier to maintain.
server/src/workflow-management/scheduler/index.ts (1)
328-358: Simplify scrape failure handling to avoid duplicate failure eventsOn scrape failure you:
- update the run to
failed,- emit
run-completed(failed) over both namespaces,- destroy the browser,
- and then
throw error;.The outer
catchat the bottom ofexecuteRunwill then:
- update the run to
failedagain,- send
run_failedwebhooks,- and emit another
run-completedfailure event.This can cause double failure events for listeners (and makes it harder to reason about what fired when), even if the second emit happens after the namespace has been cleaned up.
A cleaner pattern would be to:
- keep all scrape-specific failure handling (including webhooks if needed) inside this branch and return
falseinstead of rethrowing, or- tag the error as “already handled” and branch accordingly in the outer catch.
That way each scheduled scrape run failure results in exactly one set of notifications.
src/components/robot/RecordingsTable.tsx (1)
776-783: Consider defensive check for undefined robotType.The conditional checks
robotType !== 'scrape'without verifying thatrobotTypeis defined. Ifrow.typeis undefined (e.g., for older robots created before this field was added), the menu items will still render, which might not be the intended behavior.Consider adding a defensive check:
- {robotType !== 'scrape' && ( + {robotType && robotType !== 'scrape' && ( <MenuItem onClick={() => { handleRetrain(); handleClose(); }}> - {robotType !== 'scrape' && ( + {robotType && robotType !== 'scrape' && ( <MenuItem onClick={() => { handleDuplicate(); handleClose(); }}>Or provide a default value in the row mapping (lines 252-273) to ensure type is always defined.
Also applies to: 795-800
src/components/robot/pages/RobotCreate.tsx (2)
402-402: Accessibility: FormControl label should use FormLabel component.The label "Output Format (Select at least one)" is rendered as a plain
<p>tag instead of using the Material-UIFormLabelcomponent, which would provide better accessibility and consistent styling.Apply this diff:
<FormControl component="fieldset" sx={{ width: '100%', textAlign: 'left' }}> - <p>Output Format (Select at least one)</p> + <FormLabel component="legend">Output Format (Select at least one)</FormLabel> <FormControlLabel
381-399: Consider adding URL validation before form submission.While the onClick handler validates the URL at line 441-443, there's no real-time feedback when the user is typing. Consider adding URL validation on the TextField itself or showing a helper text to guide the user.
Example pattern:
const [urlError, setUrlError] = useState(''); const validateUrl = (value: string) => { try { new URL(value); setUrlError(''); } catch { setUrlError('Please enter a valid URL'); } };Then add to the TextField:
error={!!urlError} helperText={urlError} onBlur={(e) => validateUrl(e.target.value)}server/src/api/record.ts (1)
692-695: Consider error handling for conversion failures.The markdown and HTML conversions are called independently, but if one fails while the other succeeds, the entire run fails and neither result is saved. Consider handling partial success scenarios where at least one format conversion succeeds.
Consider this approach:
const serializableOutput: any = {}; const errors: string[] = []; // Markdown conversion if (formats.includes('markdown')) { try { markdown = await convertPageToMarkdown(url); serializableOutput.markdown = [{ content: markdown }]; } catch (err: any) { errors.push(`Markdown conversion failed: ${err.message}`); } } // HTML conversion if (formats.includes('html')) { try { html = await convertPageToHTML(url); serializableOutput.html = [{ content: html }]; } catch (err: any) { errors.push(`HTML conversion failed: ${err.message}`); } } if (Object.keys(serializableOutput).length === 0) { throw new Error(`All conversions failed: ${errors.join('; ')}`); }Also applies to: 698-701
server/src/markdownify/scrape.ts (1)
12-12: Very long timeout without justification or fallback.Both functions use a 100-second (100000ms) timeout for page load, which is unusually long. This could cause runs to hang for extended periods if a page is slow or unresponsive. Consider using a shorter timeout with proper error handling.
Apply this diff to use a more reasonable timeout:
- await page.goto(url, { waitUntil: "networkidle", timeout: 100000 }); + await page.goto(url, { waitUntil: "networkidle", timeout: 30000 });Or make it configurable:
const SCRAPE_TIMEOUT = process.env.SCRAPE_TIMEOUT_MS ? parseInt(process.env.SCRAPE_TIMEOUT_MS) : 30000; await page.goto(url, { waitUntil: "networkidle", timeout: SCRAPE_TIMEOUT });Also applies to: 67-67
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
package.json(3 hunks)server/src/api/record.ts(11 hunks)server/src/markdownify/markdown.ts(1 hunks)server/src/markdownify/scrape.ts(1 hunks)server/src/markdownify/test.ts(1 hunks)server/src/models/Robot.ts(1 hunks)server/src/pgboss-worker.ts(2 hunks)server/src/routes/storage.ts(4 hunks)server/src/workflow-management/scheduler/index.ts(3 hunks)src/api/storage.ts(1 hunks)src/components/robot/RecordingsTable.tsx(3 hunks)src/components/robot/pages/RobotCreate.tsx(2 hunks)src/components/robot/pages/RobotDuplicatePage.tsx(1 hunks)src/components/robot/pages/RobotEditPage.tsx(2 hunks)src/components/robot/pages/RobotIntegrationPage.tsx(2 hunks)src/components/robot/pages/RobotSettingsPage.tsx(1 hunks)src/components/run/RunContent.tsx(5 hunks)src/context/globalInfo.tsx(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (8)
server/src/markdownify/markdown.ts (1)
maxun-core/src/browserSide/scraper.js (1)
depth(956-956)
server/src/markdownify/scrape.ts (1)
server/src/markdownify/markdown.ts (1)
parseMarkdown(1-98)
server/src/pgboss-worker.ts (4)
maxun-core/src/interpret.ts (1)
run(1620-1656)server/src/markdownify/scrape.ts (2)
convertPageToMarkdown(8-57)convertPageToHTML(63-111)server/src/routes/webhook.ts (1)
sendWebhook(404-434)server/src/browser-management/controller.ts (1)
destroyRemoteBrowser(96-139)
server/src/routes/storage.ts (1)
server/src/routes/record.ts (1)
AuthenticatedRequest(25-27)
src/components/robot/pages/RobotCreate.tsx (2)
src/context/globalInfo.tsx (1)
useGlobalInfoStore(189-189)src/api/storage.ts (1)
createScrapeRobot(31-59)
server/src/markdownify/test.ts (1)
server/src/markdownify/scrape.ts (1)
convertPageToMarkdown(8-57)
server/src/workflow-management/scheduler/index.ts (3)
server/src/markdownify/scrape.ts (2)
convertPageToMarkdown(8-57)convertPageToHTML(63-111)server/src/routes/webhook.ts (1)
sendWebhook(404-434)server/src/browser-management/controller.ts (1)
destroyRemoteBrowser(96-139)
server/src/api/record.ts (4)
server/src/markdownify/scrape.ts (2)
convertPageToMarkdown(8-57)convertPageToHTML(63-111)server/src/routes/webhook.ts (1)
sendWebhook(404-434)server/src/browser-management/controller.ts (1)
destroyRemoteBrowser(96-139)server/src/workflow-management/scheduler/index.ts (1)
handleRunRecording(597-622)
🔇 Additional comments (7)
src/components/robot/pages/RobotSettingsPage.tsx (1)
19-21: LGTM! Type extensions are clean and backward-compatible.The optional fields added to RobotMeta (type, url, formats) align with the broader scrape functionality introduced in this PR and maintain backward compatibility.
src/context/globalInfo.tsx (1)
30-32: LGTM! RobotMeta extensions are consistent.The optional fields added to RobotMeta align with the same changes made across multiple files (RobotSettingsPage.tsx, Robot.ts model). The consistency ensures type safety throughout the application.
server/src/models/Robot.ts (1)
12-14: LGTM! Server-side type extensions align with frontend.The optional fields added to the RobotMeta interface on the server side correctly mirror the frontend type definitions, ensuring type consistency across the full stack.
src/components/robot/pages/RobotIntegrationPage.tsx (1)
131-132: Good scrape robot detection.The
isScrapeRobotflag cleanly derives from the robot metadata and is used consistently to disable incompatible integrations.package.json (1)
49-49: All dependencies are current and secure.The three dependencies specified—turndown (^7.2.2), joplin-turndown-plugin-gfm (^1.0.12), and @types/turndown (^5.0.6)—match their latest available versions and have no known security vulnerabilities according to GitHub's security advisory database.
src/components/robot/pages/RobotEditPage.tsx (1)
19-30: RobotMeta type/formats update is sound and consistentThe constrained
type?: 'extract' | 'scrape'and addedformats?: ('markdown' | 'html')[]align with the backend model and other UI code. No runtime impact here; just a safer, more explicit shape.server/src/api/record.ts (1)
1047-1047: Review comment is incorrect:waitForRunCompletionstill returns raw JSON, not a model instance.The function uses
raw: truein the Sequelize query at line 1043, which means it returns raw JSON data, not a Run model instance. The return value at line 1050 is this raw JSON object, which is correctly passed toformatRunResponsethat expects plain object properties. There is no breaking change here.Likely an incorrect or invalid review comment.
| if (recording.recording_meta.type === 'scrape') { | ||
| logger.log('info', `Executing scrape robot for API run ${id}`); | ||
|
|
||
| let formats = recording.recording_meta.formats || ['markdown']; | ||
|
|
||
| // Override if API request defines formats | ||
| if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) { | ||
| formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f)); | ||
| } | ||
|
|
||
| await run.update({ | ||
| status: 'running', | ||
| log: `Converting page to: ${formats.join(', ')}` | ||
| }); | ||
|
|
||
| try { | ||
| const url = recording.recording_meta.url; | ||
|
|
||
| if (!url) { | ||
| throw new Error('No URL specified for markdown robot'); | ||
| } | ||
|
|
||
| let markdown = ''; | ||
| let html = ''; | ||
| const serializableOutput: any = {}; | ||
|
|
||
| // Markdown conversion | ||
| if (formats.includes('markdown')) { | ||
| markdown = await convertPageToMarkdown(url); | ||
| serializableOutput.markdown = [{ content: markdown }]; | ||
| } | ||
|
|
||
| // HTML conversion | ||
| if (formats.includes('html')) { | ||
| html = await convertPageToHTML(url); | ||
| serializableOutput.html = [{ content: html }]; | ||
| } | ||
|
|
||
| await run.update({ | ||
| status: 'success', | ||
| finishedAt: new Date().toLocaleString(), | ||
| log: `${formats.join(', ')} conversion completed successfully`, | ||
| serializableOutput, | ||
| binaryOutput: {}, | ||
| }); | ||
|
|
||
| logger.log('info', `Markdown robot execution completed for API run ${id}`); | ||
|
|
||
| // Push success socket event | ||
| try { | ||
| const completionData = { | ||
| runId: plainRun.runId, | ||
| robotMetaId: plainRun.robotMetaId, | ||
| robotName: recording.recording_meta.name, | ||
| status: 'success', | ||
| finishedAt: new Date().toLocaleString() | ||
| }; | ||
|
|
||
| serverIo | ||
| .of('/queued-run') | ||
| .to(`user-${userId}`) | ||
| .emit('run-completed', completionData); | ||
| } catch (socketError: any) { | ||
| logger.log( | ||
| 'warn', | ||
| `Failed to send run-completed notification for markdown robot run ${id}: ${socketError.message}` | ||
| ); | ||
| } | ||
|
|
||
| // Build webhook payload | ||
| const webhookPayload: any = { | ||
| robot_id: plainRun.robotMetaId, | ||
| run_id: plainRun.runId, | ||
| robot_name: recording.recording_meta.name, | ||
| status: 'success', | ||
| started_at: plainRun.startedAt, | ||
| finished_at: new Date().toLocaleString(), | ||
| metadata: { | ||
| browser_id: plainRun.browserId, | ||
| user_id: userId, | ||
| }, | ||
| }; | ||
|
|
||
| if (formats.includes('markdown')) webhookPayload.markdown = markdown; | ||
| if (formats.includes('html')) webhookPayload.html = html; | ||
|
|
||
| try { | ||
| await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload); | ||
| logger.log( | ||
| 'info', | ||
| `Webhooks sent successfully for markdown robot API run ${plainRun.runId}` | ||
| ); | ||
| } catch (webhookError: any) { | ||
| logger.log( | ||
| 'warn', | ||
| `Failed to send webhooks for markdown robot run ${plainRun.runId}: ${webhookError.message}` | ||
| ); | ||
| } | ||
|
|
||
| await destroyRemoteBrowser(plainRun.browserId, userId); | ||
|
|
||
| return { | ||
| success: true, | ||
| interpretationInfo: run.toJSON() | ||
| }; | ||
| } catch (error: any) { | ||
| logger.log( | ||
| 'error', | ||
| `${formats.join(', ')} conversion failed for API run ${id}: ${error.message}` | ||
| ); | ||
|
|
||
| await run.update({ | ||
| status: 'failed', | ||
| finishedAt: new Date().toLocaleString(), | ||
| log: `${formats.join(', ')} conversion failed: ${error.message}`, | ||
| }); | ||
|
|
||
| // Send failure socket event | ||
| try { | ||
| const failureData = { | ||
| runId: plainRun.runId, | ||
| robotMetaId: plainRun.robotMetaId, | ||
| robotName: recording.recording_meta.name, | ||
| status: 'failed', | ||
| finishedAt: new Date().toLocaleString() | ||
| }; | ||
|
|
||
| serverIo | ||
| .of('/queued-run') | ||
| .to(`user-${userId}`) | ||
| .emit('run-completed', failureData); | ||
| } catch (socketError: any) { | ||
| logger.log( | ||
| 'warn', | ||
| `Failed to send run-failed notification for markdown robot run ${id}: ${socketError.message}` | ||
| ); | ||
| } | ||
|
|
||
| await destroyRemoteBrowser(plainRun.browserId, userId); | ||
|
|
||
| throw error; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent naming: logs reference "markdown robot" but supports multiple formats.
Throughout the scrape execution block, logging messages refer to "markdown robot" even though the code correctly handles both markdown and html formats based on the formats array. This creates confusion about the actual functionality.
Update log messages to be format-agnostic:
- logger.log('info', `Executing scrape robot for API run ${id}`);
+ logger.log('info', `Executing scrape robot for API run ${id} with formats: ${formats.join(', ')}`);
try {
const url = recording.recording_meta.url;
if (!url) {
- throw new Error('No URL specified for markdown robot');
+ throw new Error('No URL specified for scrape robot');
}
- logger.log('info', `Markdown robot execution completed for API run ${id}`);
+ logger.log('info', `Scrape robot execution completed for API run ${id}`);
logger.log(
'warn',
- `Failed to send run-completed notification for markdown robot run ${id}: ${socketError.message}`
+ `Failed to send run-completed notification for scrape robot run ${id}: ${socketError.message}`
);
try {
await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload);
logger.log(
'info',
- `Webhooks sent successfully for markdown robot API run ${plainRun.runId}`
+ `Webhooks sent successfully for scrape robot API run ${plainRun.runId}`
);
} catch (webhookError: any) {
logger.log(
'warn',
- `Failed to send webhooks for markdown robot run ${plainRun.runId}: ${webhookError.message}`
+ `Failed to send webhooks for scrape robot run ${plainRun.runId}: ${webhookError.message}`
);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (recording.recording_meta.type === 'scrape') { | |
| logger.log('info', `Executing scrape robot for API run ${id}`); | |
| let formats = recording.recording_meta.formats || ['markdown']; | |
| // Override if API request defines formats | |
| if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) { | |
| formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f)); | |
| } | |
| await run.update({ | |
| status: 'running', | |
| log: `Converting page to: ${formats.join(', ')}` | |
| }); | |
| try { | |
| const url = recording.recording_meta.url; | |
| if (!url) { | |
| throw new Error('No URL specified for markdown robot'); | |
| } | |
| let markdown = ''; | |
| let html = ''; | |
| const serializableOutput: any = {}; | |
| // Markdown conversion | |
| if (formats.includes('markdown')) { | |
| markdown = await convertPageToMarkdown(url); | |
| serializableOutput.markdown = [{ content: markdown }]; | |
| } | |
| // HTML conversion | |
| if (formats.includes('html')) { | |
| html = await convertPageToHTML(url); | |
| serializableOutput.html = [{ content: html }]; | |
| } | |
| await run.update({ | |
| status: 'success', | |
| finishedAt: new Date().toLocaleString(), | |
| log: `${formats.join(', ')} conversion completed successfully`, | |
| serializableOutput, | |
| binaryOutput: {}, | |
| }); | |
| logger.log('info', `Markdown robot execution completed for API run ${id}`); | |
| // Push success socket event | |
| try { | |
| const completionData = { | |
| runId: plainRun.runId, | |
| robotMetaId: plainRun.robotMetaId, | |
| robotName: recording.recording_meta.name, | |
| status: 'success', | |
| finishedAt: new Date().toLocaleString() | |
| }; | |
| serverIo | |
| .of('/queued-run') | |
| .to(`user-${userId}`) | |
| .emit('run-completed', completionData); | |
| } catch (socketError: any) { | |
| logger.log( | |
| 'warn', | |
| `Failed to send run-completed notification for markdown robot run ${id}: ${socketError.message}` | |
| ); | |
| } | |
| // Build webhook payload | |
| const webhookPayload: any = { | |
| robot_id: plainRun.robotMetaId, | |
| run_id: plainRun.runId, | |
| robot_name: recording.recording_meta.name, | |
| status: 'success', | |
| started_at: plainRun.startedAt, | |
| finished_at: new Date().toLocaleString(), | |
| metadata: { | |
| browser_id: plainRun.browserId, | |
| user_id: userId, | |
| }, | |
| }; | |
| if (formats.includes('markdown')) webhookPayload.markdown = markdown; | |
| if (formats.includes('html')) webhookPayload.html = html; | |
| try { | |
| await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload); | |
| logger.log( | |
| 'info', | |
| `Webhooks sent successfully for markdown robot API run ${plainRun.runId}` | |
| ); | |
| } catch (webhookError: any) { | |
| logger.log( | |
| 'warn', | |
| `Failed to send webhooks for markdown robot run ${plainRun.runId}: ${webhookError.message}` | |
| ); | |
| } | |
| await destroyRemoteBrowser(plainRun.browserId, userId); | |
| return { | |
| success: true, | |
| interpretationInfo: run.toJSON() | |
| }; | |
| } catch (error: any) { | |
| logger.log( | |
| 'error', | |
| `${formats.join(', ')} conversion failed for API run ${id}: ${error.message}` | |
| ); | |
| await run.update({ | |
| status: 'failed', | |
| finishedAt: new Date().toLocaleString(), | |
| log: `${formats.join(', ')} conversion failed: ${error.message}`, | |
| }); | |
| // Send failure socket event | |
| try { | |
| const failureData = { | |
| runId: plainRun.runId, | |
| robotMetaId: plainRun.robotMetaId, | |
| robotName: recording.recording_meta.name, | |
| status: 'failed', | |
| finishedAt: new Date().toLocaleString() | |
| }; | |
| serverIo | |
| .of('/queued-run') | |
| .to(`user-${userId}`) | |
| .emit('run-completed', failureData); | |
| } catch (socketError: any) { | |
| logger.log( | |
| 'warn', | |
| `Failed to send run-failed notification for markdown robot run ${id}: ${socketError.message}` | |
| ); | |
| } | |
| await destroyRemoteBrowser(plainRun.browserId, userId); | |
| throw error; | |
| } | |
| } | |
| if (recording.recording_meta.type === 'scrape') { | |
| logger.log('info', `Executing scrape robot for API run ${id} with formats: ${formats.join(', ')}`); | |
| let formats = recording.recording_meta.formats || ['markdown']; | |
| // Override if API request defines formats | |
| if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) { | |
| formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f)); | |
| } | |
| await run.update({ | |
| status: 'running', | |
| log: `Converting page to: ${formats.join(', ')}` | |
| }); | |
| try { | |
| const url = recording.recording_meta.url; | |
| if (!url) { | |
| throw new Error('No URL specified for scrape robot'); | |
| } | |
| let markdown = ''; | |
| let html = ''; | |
| const serializableOutput: any = {}; | |
| // Markdown conversion | |
| if (formats.includes('markdown')) { | |
| markdown = await convertPageToMarkdown(url); | |
| serializableOutput.markdown = [{ content: markdown }]; | |
| } | |
| // HTML conversion | |
| if (formats.includes('html')) { | |
| html = await convertPageToHTML(url); | |
| serializableOutput.html = [{ content: html }]; | |
| } | |
| await run.update({ | |
| status: 'success', | |
| finishedAt: new Date().toLocaleString(), | |
| log: `${formats.join(', ')} conversion completed successfully`, | |
| serializableOutput, | |
| binaryOutput: {}, | |
| }); | |
| logger.log('info', `Scrape robot execution completed for API run ${id}`); | |
| // Push success socket event | |
| try { | |
| const completionData = { | |
| runId: plainRun.runId, | |
| robotMetaId: plainRun.robotMetaId, | |
| robotName: recording.recording_meta.name, | |
| status: 'success', | |
| finishedAt: new Date().toLocaleString() | |
| }; | |
| serverIo | |
| .of('/queued-run') | |
| .to(`user-${userId}`) | |
| .emit('run-completed', completionData); | |
| } catch (socketError: any) { | |
| logger.log( | |
| 'warn', | |
| `Failed to send run-completed notification for scrape robot run ${id}: ${socketError.message}` | |
| ); | |
| } | |
| // Build webhook payload | |
| const webhookPayload: any = { | |
| robot_id: plainRun.robotMetaId, | |
| run_id: plainRun.runId, | |
| robot_name: recording.recording_meta.name, | |
| status: 'success', | |
| started_at: plainRun.startedAt, | |
| finished_at: new Date().toLocaleString(), | |
| metadata: { | |
| browser_id: plainRun.browserId, | |
| user_id: userId, | |
| }, | |
| }; | |
| if (formats.includes('markdown')) webhookPayload.markdown = markdown; | |
| if (formats.includes('html')) webhookPayload.html = html; | |
| try { | |
| await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload); | |
| logger.log( | |
| 'info', | |
| `Webhooks sent successfully for scrape robot API run ${plainRun.runId}` | |
| ); | |
| } catch (webhookError: any) { | |
| logger.log( | |
| 'warn', | |
| `Failed to send webhooks for scrape robot run ${plainRun.runId}: ${webhookError.message}` | |
| ); | |
| } | |
| await destroyRemoteBrowser(plainRun.browserId, userId); | |
| return { | |
| success: true, | |
| interpretationInfo: run.toJSON() | |
| }; | |
| } catch (error: any) { | |
| logger.log( | |
| 'error', | |
| `${formats.join(', ')} conversion failed for API run ${id}: ${error.message}` | |
| ); | |
| await run.update({ | |
| status: 'failed', | |
| finishedAt: new Date().toLocaleString(), | |
| log: `${formats.join(', ')} conversion failed: ${error.message}`, | |
| }); | |
| // Send failure socket event | |
| try { | |
| const failureData = { | |
| runId: plainRun.runId, | |
| robotMetaId: plainRun.robotMetaId, | |
| robotName: recording.recording_meta.name, | |
| status: 'failed', | |
| finishedAt: new Date().toLocaleString() | |
| }; | |
| serverIo | |
| .of('/queued-run') | |
| .to(`user-${userId}`) | |
| .emit('run-completed', failureData); | |
| } catch (socketError: any) { | |
| logger.log( | |
| 'warn', | |
| `Failed to send run-failed notification for markdown robot run ${id}: ${socketError.message}` | |
| ); | |
| } | |
| await destroyRemoteBrowser(plainRun.browserId, userId); | |
| throw error; | |
| } | |
| } |
| // Override if API request defines formats | ||
| if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) { | ||
| formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation: requestedFormats should be validated before use.
When requestedFormats is provided, line 672 filters it to only include valid formats, but there's no check to ensure the resulting array isn't empty. If a caller passes formats: ['invalid'], the filtered array would be empty and no output would be generated, but the run would still succeed.
Apply this diff to add validation:
// Override if API request defines formats
if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) {
formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f));
+ if (formats.length === 0) {
+ throw new Error('No valid formats specified. Supported formats: markdown, html');
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Override if API request defines formats | |
| if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) { | |
| formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f)); | |
| } | |
| // Override if API request defines formats | |
| if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) { | |
| formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f)); | |
| if (formats.length === 0) { | |
| throw new Error('No valid formats specified. Supported formats: markdown, html'); | |
| } | |
| } |
🤖 Prompt for AI Agents
In server/src/api/record.ts around lines 670 to 673, the code filters
requestedFormats but does not validate the result; after filtering, check
whether the resulting formats array is non-empty and if it is empty return a 400
Bad Request (or throw a validation error) with a clear message like
"requestedFormats contains no supported formats; supported: markdown, html";
ensure you stop further processing when invalid so the run does not succeed with
no outputs.
| // Convert HTML → Markdown | ||
| try { | ||
| let out = await t.turndown(tidiedHtml); | ||
| out = fixBrokenLinks(out); | ||
| out = stripSkipLinks(out); | ||
| return out.trim(); | ||
| } catch (err) { | ||
| console.error("HTML→Markdown failed", { err }); | ||
| return ""; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Guard against tidyHtml returning null before passing to Turndown
tidyHtml currently does:
function tidyHtml(html: string): string {
const $ = cheerio.load(html);
// ...
manuallyCleanedElements.forEach((tag) => $(tag).remove());
return $("body").html();
}But $("body").html() can be null, so tidiedHtml in parseMarkdown may be null even though the function is typed as returning string. That value is then passed into t.turndown(...), which can throw and push you into the catch path unnecessarily.
Two concrete improvements:
-function tidyHtml(html: string): string {
+function tidyHtml(html: string): string {
const cheerio = require("cheerio");
const $ = cheerio.load(html);
// ...
- manuallyCleanedElements.forEach((tag) => $(tag).remove());
- return $("body").html();
+ manuallyCleanedElements.forEach((tag) => $(tag).remove());
+
+ const bodyHtml = $("body").html();
+ return bodyHtml || html;
}and/or in parseMarkdown:
- const tidiedHtml = tidyHtml(html);
+ const tidiedHtml = tidyHtml(html) || html;This avoids spurious conversion failures and aligns the implementation with the declared return type.
Also applies to: 124-144
I'll verify the review comment by examining the actual code and confirming the Cheerio behavior.
🏁 Script executed:
#!/bin/bash
# Examine the markdown.ts file to see the parseMarkdown function and tidyHtml implementation
fd -e ts -n "markdown.ts" | head -5 | xargs -I {} cat "{}"Length of output: 230
🏁 Script executed:
#!/bin/bash
# Search for tidyHtml function definition to confirm the implementation
rg -n "function tidyHtml" --type ts -A 15Length of output: 920
🏁 Script executed:
#!/bin/bash
# Search for parseMarkdown to see how tidiedHtml is used
rg -n "const tidiedHtml" --type ts -B 5 -A 10Length of output: 1063
🏁 Script executed:
#!/bin/bash
# Check if cheerio is in package.json and what version
fd -e json "package.json" | head -1 | xargs -I {} grep -A 50 '"dependencies"' {} | grep -i cheerioLength of output: 0
🌐 Web query:
Cheerio .html() method return type null or string
💡 Result:
Cheerio's .html() returns string | null — it returns the HTML of the first matched element, or null if nothing is matched. [1][2]
Example (JS/TS):
const $ = cheerio.load('');
const v = $('div').html(); // v is string | null — here null if no
Sources: Cheerio API/docs. [1] [2]
References
[1] Cheerio — Classes: Cheerio.html() signature. [2] Cheerio — CheerioAPI / docs.
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the complete tidyHtml function (should be around lines 124-144)
sed -n '124,150p' server/src/markdownify/markdown.ts
Length of output: 501
🏁 Script executed:
#!/bin/bash
# Also check the full parseMarkdown function around line 88-97
sed -n '1,100p' server/src/markdownify/markdown.ts | tail -30Length of output: 600
Guard against tidyHtml returning null before passing to Turndown
tidyHtml at line 124 declares a return type of string, but currently returns $("body").html() directly. Cheerio's .html() method returns string | null, so if the body element is missing or empty, the function returns null despite the type annotation.
This null value is then passed to t.turndown(tidiedHtml) at line 91, which expects a string. While the try/catch will handle the resulting error, this creates unnecessary failures and obscures actual issues.
Two concrete improvements:
At line 124–144 in tidyHtml:
function tidyHtml(html: string): string {
const cheerio = require("cheerio");
const $ = cheerio.load(html);
// ...
manuallyCleanedElements.forEach((tag) => $(tag).remove());
const bodyHtml = $("body").html();
return bodyHtml || html;
}Alternatively at line 12 in parseMarkdown:
const tidiedHtml = tidyHtml(html) || html;This aligns the implementation with the declared return type and prevents unnecessary conversion failures.
🤖 Prompt for AI Agents
In server/src/markdownify/markdown.ts around lines 88–97 (and tidyHtml
implementation around lines 124–144), tidyHtml can return null because cheerio's
$("body").html() is string | null; update tidyHtml to coerce a non-null string
(e.g., const bodyHtml = $("body").html(); return bodyHtml || html;) or,
alternatively, change the caller at line ~88 to use const tidiedHtml =
tidyHtml(html) || html before passing to t.turndown; ensure the value passed to
turndown is always a string to prevent runtime errors.
| export async function convertPageToMarkdown(url: string): Promise<string> { | ||
| const browser = await chromium.launch(); | ||
| const page = await browser.newPage(); | ||
|
|
||
| await page.goto(url, { waitUntil: "networkidle", timeout: 100000 }); | ||
|
|
||
| await page.addInitScript(() => { | ||
| const selectors = [ | ||
| "script", | ||
| "style", | ||
| "link[rel='stylesheet']", | ||
| "noscript", | ||
| "meta", | ||
| "svg", | ||
| "img", | ||
| "picture", | ||
| "source", | ||
| "video", | ||
| "audio", | ||
| "iframe", | ||
| "object", | ||
| "embed" | ||
| ]; | ||
|
|
||
| selectors.forEach(sel => { | ||
| document.querySelectorAll(sel).forEach(e => e.remove()); | ||
| }); | ||
|
|
||
| // Remove inline event handlers (onclick, onload…) | ||
| const all = document.querySelectorAll("*"); | ||
| all.forEach(el => { | ||
| [...el.attributes].forEach(attr => { | ||
| if (attr.name.startsWith("on")) { | ||
| el.removeAttribute(attr.name); | ||
| } | ||
| }); | ||
| }); | ||
| }); | ||
|
|
||
| // Re-extract HTML after cleanup | ||
| const cleanedHtml = await page.evaluate(() => { | ||
| return document.documentElement.outerHTML; | ||
| }); | ||
|
|
||
| await browser.close(); | ||
|
|
||
| // Convert cleaned HTML → Markdown | ||
| const markdown = await parseMarkdown(cleanedHtml, url); | ||
| return markdown; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing resource cleanup: browser should be closed in a finally block.
If an error occurs during page navigation or evaluation, the browser instance won't be closed, leading to resource leaks. Both functions should wrap browser operations in try-finally blocks.
Apply this diff to both functions:
export async function convertPageToMarkdown(url: string): Promise<string> {
const browser = await chromium.launch();
- const page = await browser.newPage();
+ try {
+ const page = await browser.newPage();
- await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
+ await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
- await page.addInitScript(() => {
- // ... cleanup script
- });
+ // ... rest of the logic
- const cleanedHtml = await page.evaluate(() => {
- return document.documentElement.outerHTML;
- });
+ const cleanedHtml = await page.evaluate(() => {
+ return document.documentElement.outerHTML;
+ });
- await browser.close();
+ const markdown = await parseMarkdown(cleanedHtml, url);
+ return markdown;
+ } finally {
+ await browser.close();
+ }
+}Also applies to: 63-111
🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 8-57 (and similarly at 63-111),
the browser is launched and operations are performed but browser.close() is only
called on the happy path; wrap the Playwright flow in a try/finally so the
browser is closed regardless of errors. Specifically, after launching the
browser (and creating the page), move the navigation/evaluation logic into a try
block and put await browser.close() in a finally block (guarding for a possibly
undefined browser), optionally closing the page if created; rethrow the error
after cleanup so behavior is preserved.
| await page.addInitScript(() => { | ||
| const selectors = [ | ||
| "script", | ||
| "style", | ||
| "link[rel='stylesheet']", | ||
| "noscript", | ||
| "meta", | ||
| "svg", | ||
| "img", | ||
| "picture", | ||
| "source", | ||
| "video", | ||
| "audio", | ||
| "iframe", | ||
| "object", | ||
| "embed" | ||
| ]; | ||
|
|
||
| selectors.forEach(sel => { | ||
| document.querySelectorAll(sel).forEach(e => e.remove()); | ||
| }); | ||
|
|
||
| // Remove inline event handlers (onclick, onload…) | ||
| const all = document.querySelectorAll("*"); | ||
| all.forEach(el => { | ||
| [...el.attributes].forEach(attr => { | ||
| if (attr.name.startsWith("on")) { | ||
| el.removeAttribute(attr.name); | ||
| } | ||
| }); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Significant code duplication: cleanup logic is identical in both functions.
The element removal and event handler cleanup code (lines 14-45 and 69-100) is duplicated verbatim between convertPageToMarkdown and convertPageToHTML. This violates the DRY principle and makes maintenance harder.
Extract the cleanup logic into a shared helper function:
async function cleanPage(page: Page): Promise<string> {
await page.addInitScript(() => {
const selectors = [
"script",
"style",
"link[rel='stylesheet']",
"noscript",
"meta",
"svg",
"img",
"picture",
"source",
"video",
"audio",
"iframe",
"object",
"embed"
];
selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(e => e.remove());
});
// Remove inline event handlers
const all = document.querySelectorAll("*");
all.forEach(el => {
[...el.attributes].forEach(attr => {
if (attr.name.startsWith("on")) {
el.removeAttribute(attr.name);
}
});
});
});
return await page.evaluate(() => {
return document.documentElement.outerHTML;
});
}
export async function convertPageToMarkdown(url: string): Promise<string> {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
const cleanedHtml = await cleanPage(page);
await browser.close();
const markdown = await parseMarkdown(cleanedHtml, url);
return markdown;
}
export async function convertPageToHTML(url: string): Promise<string> {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
const cleanedHtml = await cleanPage(page);
await browser.close();
return cleanedHtml;
}Also applies to: 69-100
🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 14-45 and 69-100, the DOM
cleanup logic is duplicated; extract it into a single async helper (e.g.,
cleanPage(page: Page): Promise<string>) that calls page.addInitScript with the
selector removal and inline event-attribute cleanup, then returns await
page.evaluate(() => document.documentElement.outerHTML); replace the duplicated
blocks in convertPageToMarkdown and convertPageToHTML with a call to this
helper, keep launching/closing the browser and page handling the same, and add
any necessary imports/types for Playwright's Page.
Critical issue: addInitScript runs AFTER initial page load, so cleanup doesn't affect the fetched content.
The addInitScript is called after page.goto, which means it only affects subsequent navigations, not the current page that was just loaded. The cleanup script never runs on the initial page load, so all scripts, styles, images, etc. are still present in the extracted HTML.
Move the cleanup to execute on the loaded page using evaluate or call addInitScript before goto:
export async function convertPageToMarkdown(url: string): Promise<string> {
const browser = await chromium.launch();
const page = await browser.newPage();
+ await page.addInitScript(() => {
+ // Cleanup script here...
+ });
+
await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
- await page.addInitScript(() => {
- // Cleanup script here...
- });
-
- // Re-extract HTML after cleanup
- const cleanedHtml = await page.evaluate(() => {
+ const cleanedHtml = await page.evaluate(() => {
+ // Run cleanup directly in evaluate
+ const selectors = ["script", "style", ...];
+ selectors.forEach(sel => {
+ document.querySelectorAll(sel).forEach(e => e.remove());
+ });
+
+ const all = document.querySelectorAll("*");
+ all.forEach(el => {
+ [...el.attributes].forEach(attr => {
+ if (attr.name.startsWith("on")) {
+ el.removeAttribute(attr.name);
+ }
+ });
+ });
+
return document.documentElement.outerHTML;
});Also applies to: 69-100
🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 14 to 45 (and similarly for
lines ~69-100), the cleanup script is being registered with page.addInitScript
after page.goto so it doesn't run on the initial navigation; move the cleanup to
run on the loaded page by either registering the addInitScript before calling
page.goto OR running the same DOM-removal logic immediately after navigation
using page.evaluate; update both places to ensure the selectors and inline-event
removal execute on the current document (call addInitScript before any
navigation or replace the post-load addInitScript with page.evaluate after await
page.goto).
| if (recording.recording_meta.type === 'scrape') { | ||
| logger.log('info', `Executing scrape robot for scheduled run ${id}`); | ||
|
|
||
| const formats = recording.recording_meta.formats || ['markdown']; | ||
|
|
||
| await run.update({ | ||
| status: 'running', | ||
| log: `Converting page to: ${formats.join(', ')}` | ||
| }); | ||
|
|
||
| try { | ||
| const runStartedData = { | ||
| runId: plainRun.runId, | ||
| robotMetaId: plainRun.robotMetaId, | ||
| robotName: recording.recording_meta.name, | ||
| status: 'running', | ||
| startedAt: plainRun.startedAt | ||
| }; | ||
|
|
||
| serverIo.of('/queued-run').to(`user-${userId}`).emit('run-started', runStartedData); | ||
| logger.log( | ||
| 'info', | ||
| `Markdown robot run started notification sent for run: ${plainRun.runId} to user-${userId}` | ||
| ); | ||
| } catch (socketError: any) { | ||
| logger.log( | ||
| 'warn', | ||
| `Failed to send run-started notification for markdown robot run ${plainRun.runId}: ${socketError.message}` | ||
| ); | ||
| } | ||
|
|
||
| try { | ||
| const url = recording.recording_meta.url; | ||
|
|
||
| if (!url) { | ||
| throw new Error('No URL specified for markdown robot'); | ||
| } | ||
|
|
||
| let markdown = ''; | ||
| let html = ''; | ||
| const serializableOutput: any = {}; | ||
|
|
||
| // Markdown conversion | ||
| if (formats.includes('markdown')) { | ||
| markdown = await convertPageToMarkdown(url); | ||
| serializableOutput.markdown = [{ content: markdown }]; | ||
| } | ||
|
|
||
| // HTML conversion | ||
| if (formats.includes('html')) { | ||
| html = await convertPageToHTML(url); | ||
| serializableOutput.html = [{ content: html }]; | ||
| } | ||
|
|
||
| await run.update({ | ||
| status: 'success', | ||
| finishedAt: new Date().toLocaleString(), | ||
| log: `${formats.join(', ')} conversion completed successfully`, | ||
| serializableOutput, | ||
| binaryOutput: {}, | ||
| }); | ||
|
|
||
| logger.log('info', `Markdown robot execution completed for scheduled run ${id}`); | ||
|
|
||
| // Run-completed socket notifications | ||
| try { | ||
| const completionData = { | ||
| runId: plainRun.runId, | ||
| robotMetaId: plainRun.robotMetaId, | ||
| robotName: recording.recording_meta.name, | ||
| status: 'success', | ||
| finishedAt: new Date().toLocaleString() | ||
| }; | ||
|
|
||
| serverIo.of(plainRun.browserId).emit('run-completed', completionData); | ||
| serverIo.of('/queued-run').to(`user-${userId}`).emit('run-completed', completionData); | ||
| } catch (socketError: any) { | ||
| logger.log( | ||
| 'warn', | ||
| `Failed to send run-completed notification for markdown robot run ${id}: ${socketError.message}` | ||
| ); | ||
| } | ||
|
|
||
| // Webhook payload | ||
| const webhookPayload: any = { | ||
| robot_id: plainRun.robotMetaId, | ||
| run_id: plainRun.runId, | ||
| robot_name: recording.recording_meta.name, | ||
| status: 'success', | ||
| started_at: plainRun.startedAt, | ||
| finished_at: new Date().toLocaleString(), | ||
| metadata: { | ||
| browser_id: plainRun.browserId, | ||
| user_id: userId, | ||
| } | ||
| }; | ||
|
|
||
| if (formats.includes('markdown')) webhookPayload.markdown = markdown; | ||
| if (formats.includes('html')) webhookPayload.html = html; | ||
|
|
||
| try { | ||
| await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload); | ||
| logger.log( | ||
| 'info', | ||
| `Webhooks sent successfully for markdown robot scheduled run ${plainRun.runId}` | ||
| ); | ||
| } catch (webhookError: any) { | ||
| logger.log( | ||
| 'warn', | ||
| `Failed to send webhooks for markdown robot run ${plainRun.runId}: ${webhookError.message}` | ||
| ); | ||
| } | ||
|
|
||
| await destroyRemoteBrowser(plainRun.browserId, userId); | ||
|
|
||
| return true; | ||
|
|
||
| } catch (error: any) { | ||
| logger.log('error', `${formats.join(', ')} conversion failed for scheduled run ${id}: ${error.message}`); | ||
|
|
||
| await run.update({ | ||
| status: 'failed', | ||
| finishedAt: new Date().toLocaleString(), | ||
| log: `${formats.join(', ')} conversion failed: ${error.message}`, | ||
| }); | ||
|
|
||
| try { | ||
| const failureData = { | ||
| runId: plainRun.runId, | ||
| robotMetaId: plainRun.robotMetaId, | ||
| robotName: recording.recording_meta.name, | ||
| status: 'failed', | ||
| finishedAt: new Date().toLocaleString() | ||
| }; | ||
|
|
||
| serverIo.of(plainRun.browserId).emit('run-completed', failureData); | ||
| serverIo.of('/queued-run').to(`user-${userId}`).emit('run-completed', failureData); | ||
| } catch (socketError: any) { | ||
| logger.log( | ||
| 'warn', | ||
| `Failed to send run-failed notification for markdown robot run ${id}: ${socketError.message}` | ||
| ); | ||
| } | ||
|
|
||
| await destroyRemoteBrowser(plainRun.browserId, userId); | ||
|
|
||
| throw error; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Scrape path bypasses proxy/remote browser config and duplicates logic
The new scrape branch works functionally, but a few aspects are worth tightening:
-
Proxy and browser config are ignored
Earlier in this file you resolve proxy settings and create a remote browser for the run, but
convertPageToMarkdown/HTMLeach launch their own Playwright browser with default options. That means:- user-specific proxy settings are not applied for scrape runs;
- stealth/plugins or other browser configuration used for normal runs are bypassed.
For users who rely on proxies to reach target sites, scrape runs will likely fail even though extract-type robots would succeed.
Consider either:
- passing proxy options into
convertPageToMarkdown/HTML, or - refactoring them to operate on the already-initialized remote browser/page rather than spinning up a separate browser.
-
Logic is duplicated with
server/src/pgboss-worker.tsThe scrape success path (formats handling, serializableOutput shape, socket notifications, webhook payload structure) is almost identical to the one added in
pgboss-worker.ts. Extracting a shared helper (e.g.,runScrapeConversion(recording, formats, run, userId, browserId, context)) would keep behavior consistent and make future changes (like the proxy handling above) less error-prone.
I'll verify the key claims in the review comment by examining the implementation and comparing with pgboss-worker.ts.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find implementations of convertPageToMarkdown and convertPageToHTML
rg -n "convertPageToMarkdown|convertPageToHTML" -A 10 -B 2 --type=ts
Length of output: 9526
🏁 Script executed:
#!/bin/bash
# Check how proxy options are set up earlier in scheduler/index.ts
rg -n "proxy" server/src/workflow-management/scheduler/index.ts -B 3 -A 3 --type=tsLength of output: 850
🏁 Script executed:
#!/bin/bash
# Extract the scrape success path logic from scheduler/index.ts and compare with pgboss-worker
sed -n '211,359p' server/src/workflow-management/scheduler/index.ts > /tmp/scheduler_scrape.ts
cat /tmp/scheduler_scrape.ts | head -150Length of output: 4956
🏁 Script executed:
#!/bin/bash
# Look for similar scrape handling in pgboss-worker.ts
rg -n "convertPageToMarkdown|convertPageToHTML|scrape" server/src/pgboss-worker.ts -B 2 -A 15 --type=ts | head -100Length of output: 3827
I need to verify one more claim about the failure handling doubling. Let me check the outer catch block.
---
🏁 Script executed:
```shell
#!/bin/bash
# Look for executeRun function and its outer catch block to verify double failure handling
rg -n "executeRun|catch.*error" server/src/workflow-management/scheduler/index.ts -B 2 -A 5 | tail -80
Length of output: 2075
🏁 Script executed:
#!/bin/bash
# Check the full function structure around executeRun to understand error propagation
sed -n '130,220p' server/src/workflow-management/scheduler/index.tsLength of output: 2960
Scrape path bypasses proxy/remote browser config and duplicates logic
The new scrape branch works functionally, but has three real concerns worth addressing:
-
Proxy and browser config are ignored
The proxy config is retrieved and configured earlier in this file (lines 37-47), but
convertPageToMarkdownandconvertPageToHTMLlaunch their own Playwright browser instances with default options. This means:- user-specific proxy settings are not applied for scrape runs
- stealth plugins or other browser configuration used for normal runs are bypassed
Users who rely on proxies to reach target sites will see scrape runs fail where extract-type robots would succeed.
Consider passing
proxyOptionsinto these functions or refactoring them to operate on the already-initialized remote browser instead. -
Logic is duplicated across multiple files
The scrape success path (formats handling, serializableOutput shape, socket notifications, webhook payload structure) is nearly identical in:
server/src/workflow-management/scheduler/index.ts(lines 254–270)server/src/pgboss-worker.ts(lines 214–232)server/src/api/record.ts(lines 691–709)
Extracting a shared helper would keep behavior consistent and reduce maintenance risk.
-
Double failure status updates on error
When conversion fails, the inner catch (line 328) updates
run.statusto'failed'and throws, which is then caught by the outer handler (line 508), updating the status again. This causes duplicate notifications and database writes.
🤖 Prompt for AI Agents
In server/src/workflow-management/scheduler/index.ts around lines 211–359 the
new scrape branch bypasses user proxy/remote-browser config, duplicates
result/notification/webhook logic found in other files, and performs a redundant
run.status update on conversion error; fix by 1) changing the markdown/html
conversion calls to accept and use the existing proxy/browser options (or accept
a browser instance) so they reuse the remote browser/proxy/stealth config
initialized earlier (pass proxyOptions or browser handle into
convertPageToMarkdown/convertPageToHTML), 2) extract the shared success/failure
handling (building serializableOutput, socket notifications, webhook payload and
run.update calls) into a single helper used by scheduler, pgBoss worker and
api/record to avoid duplication, and 3) remove the inner catch’s run.update that
sets status to 'failed' (or convert it to only prepare failure data without
persisting) so that the outer error handler is the single place that updates run
status and emits failure notifications to prevent double updates.
| export const createScrapeRobot = async ( | ||
| url: string, | ||
| name?: string, | ||
| formats: string[] = ['markdown'] | ||
| ): Promise<any> => { | ||
| try { | ||
| const response = await axios.post( | ||
| `${apiUrl}/storage/recordings/scrape`, | ||
| { | ||
| url, | ||
| name, | ||
| formats, | ||
| }, | ||
| { | ||
| headers: { 'Content-Type': 'application/json' }, | ||
| withCredentials: true, | ||
| } | ||
| ); | ||
|
|
||
| if (response.status === 201) { | ||
| return response.data; | ||
| } else { | ||
| throw new Error('Failed to create markdown robot'); | ||
| } | ||
| } catch (error: any) { | ||
| console.error('Error creating markdown robot:', error); | ||
| return null; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix error message and consider error handling consistency.
The error message on line 53 references "markdown robot" but should reference "scrape robot" for consistency with the function name and broader terminology.
Apply this diff:
} else {
- throw new Error('Failed to create markdown robot');
+ throw new Error('Failed to create scrape robot');
}
} catch (error: any) {
- console.error('Error creating markdown robot:', error);
+ console.error('Error creating scrape robot:', error);
return null;
}Additionally, consider if returning null on error is the best approach. Other functions in this file (e.g., getStoredRecordings, getStoredRecording) also return null on error, which is consistent, but this pattern forces callers to check for null. Consider whether throwing errors would provide better error propagation.
🤖 Prompt for AI Agents
In src/api/storage.ts around lines 31 to 59, the error messages refer to
"markdown robot" but the function creates a "scrape robot" — update the thrown
Error and the console.error message to reference "scrape robot" instead;
additionally decide on consistent error handling: either keep returning null on
error to match getStoredRecordings/getStoredRecording (no behavioral change), or
switch to rethrowing the caught error (or throw a new Error with context) so
callers can handle exceptions — implement one approach consistently and update
the function signature/JSdoc/comments accordingly.
| notify('success', `${scrapeRobotName} created successfully!`); | ||
| navigate('/robots'); | ||
| } else { | ||
| notify('error', 'Failed to create markdown robot'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent error message: references "markdown robot" instead of "scrape robot".
The error message at line 463 says "Failed to create markdown robot" but the Scrape tab supports both markdown and HTML formats. The message should be more generic.
Apply this diff:
- notify('error', 'Failed to create markdown robot');
+ notify('error', 'Failed to create scrape robot');📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| notify('error', 'Failed to create markdown robot'); | |
| notify('error', 'Failed to create scrape robot'); |
🤖 Prompt for AI Agents
In src/components/robot/pages/RobotCreate.tsx around line 463, the notify error
message incorrectly says "Failed to create markdown robot"; update this to a
generic message that covers both markdown and HTML (e.g., "Failed to create
scrape robot" or "Failed to create robot") so the Scrape tab error is accurate;
replace the string in the notify('error', ...) call and run/adjust any tests or
snapshots that assert the old message.
| <ListItemIcon> | ||
| <Refresh fontSize="small" /> | ||
| </ListItemIcon> | ||
| <ListItemText>Retrain</ListItemText> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Use translation keys instead of hardcoded English strings.
The menu item labels were changed from translated strings (using t(...)) to hardcoded English text. This breaks internationalization support that exists throughout the rest of the component.
Apply this diff to restore i18n support:
- <ListItemText>Retrain</ListItemText>
+ <ListItemText>{t('recordingtable.options.retrain')}</ListItemText>
<MenuItem onClick={() => { handleEdit(); handleClose(); }}>
<ListItemIcon><Edit fontSize="small" /></ListItemIcon>
- <ListItemText>Edit</ListItemText>
+ <ListItemText>{t('recordingtable.options.edit')}</ListItemText>
</MenuItem>
<MenuItem onClick={() => { handleDelete(); handleClose(); }}>
<ListItemIcon><DeleteForever fontSize="small" /></ListItemIcon>
- <ListItemText>Delete</ListItemText>
+ <ListItemText>{t('recordingtable.options.delete')}</ListItemText>
</MenuItem>
{robotType !== 'scrape' && (
<MenuItem onClick={() => { handleDuplicate(); handleClose(); }}>
<ListItemIcon><ContentCopy fontSize="small" /></ListItemIcon>
- <ListItemText>Duplicate</ListItemText>
+ <ListItemText>{t('recordingtable.options.duplicate')}</ListItemText>
</MenuItem>
)}Also applies to: 787-787, 792-792, 798-798
🤖 Prompt for AI Agents
In src/components/robot/RecordingsTable.tsx around lines 781 (and similarly at
lines 787, 792, 798), several MenuItem labels were changed to hardcoded English
strings like "Retrain", breaking i18n; replace those hardcoded strings with the
translation function calls (e.g. t('your.translation.key') for each label),
using the existing translation keys used elsewhere in the component (or add
appropriate new keys) so each MenuItemText uses t(...) instead of a literal
string, and ensure the component imports/has access to the t function from
react-i18next as used elsewhere in this file.
| useEffect(() => { | ||
| setMarkdownContent(''); | ||
| setHtmlContent(''); | ||
|
|
||
| if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown)) { | ||
| const markdownData = row.serializableOutput.markdown[0]; | ||
| if (markdownData?.content) { | ||
| setMarkdownContent(markdownData.content); | ||
| } | ||
| } | ||
|
|
||
| if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html)) { | ||
| const htmlData = row.serializableOutput.html[0]; | ||
| if (htmlData?.content) { | ||
| setHtmlContent(htmlData.content); | ||
| } | ||
| } | ||
| }, [row.serializableOutput]); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add safety checks for array access.
The code assumes the markdown and html arrays have at least one element without validation. If the arrays are empty, accessing [0] will return undefined, which could cause issues.
Apply this diff to add safety checks:
useEffect(() => {
setMarkdownContent('');
setHtmlContent('');
- if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown)) {
+ if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown) && row.serializableOutput.markdown.length > 0) {
const markdownData = row.serializableOutput.markdown[0];
if (markdownData?.content) {
setMarkdownContent(markdownData.content);
}
}
- if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html)) {
+ if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html) && row.serializableOutput.html.length > 0) {
const htmlData = row.serializableOutput.html[0];
if (htmlData?.content) {
setHtmlContent(htmlData.content);
}
}
}, [row.serializableOutput]);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| useEffect(() => { | |
| setMarkdownContent(''); | |
| setHtmlContent(''); | |
| if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown)) { | |
| const markdownData = row.serializableOutput.markdown[0]; | |
| if (markdownData?.content) { | |
| setMarkdownContent(markdownData.content); | |
| } | |
| } | |
| if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html)) { | |
| const htmlData = row.serializableOutput.html[0]; | |
| if (htmlData?.content) { | |
| setHtmlContent(htmlData.content); | |
| } | |
| } | |
| }, [row.serializableOutput]); | |
| useEffect(() => { | |
| setMarkdownContent(''); | |
| setHtmlContent(''); | |
| if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown) && row.serializableOutput.markdown.length > 0) { | |
| const markdownData = row.serializableOutput.markdown[0]; | |
| if (markdownData?.content) { | |
| setMarkdownContent(markdownData.content); | |
| } | |
| } | |
| if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html) && row.serializableOutput.html.length > 0) { | |
| const htmlData = row.serializableOutput.html[0]; | |
| if (htmlData?.content) { | |
| setHtmlContent(htmlData.content); | |
| } | |
| } | |
| }, [row.serializableOutput]); |
🤖 Prompt for AI Agents
In src/components/run/RunContent.tsx around lines 68 to 86 the effect assumes
row.serializableOutput.markdown and .html arrays contain at least one element
and accesses [0] directly; update both branches to first check that the array
exists and has length > 0 (or use .find to locate the first element with a
content property) before reading [0] or .content, and only call
setMarkdownContent/setHtmlContent when a valid item with .content is found to
avoid undefined access when arrays are empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/components/robot/pages/RobotCreate.tsx (1)
464-464: Inconsistent error message: references "markdown robot" instead of "scrape robot".The error message says "Failed to create markdown robot" but the Scrape tab supports both markdown and HTML formats. The message should be generic.
Apply this diff:
- notify('error', 'Failed to create markdown robot'); + notify('error', 'Failed to create scrape robot');
🧹 Nitpick comments (2)
src/components/robot/pages/RobotCreate.tsx (2)
403-403: Consider usingFormLabelfor consistency.The label uses a plain
<p>tag instead of Material-UI'sFormLabelcomponent, which would be more consistent with the FormControl usage.Apply this diff:
- <FormControl component="fieldset" sx={{ width: '100%', textAlign: 'left' }}> - <p>Output Format (Select at least one)</p> + <FormControl component="fieldset" sx={{ width: '100%', textAlign: 'left' }}> + <FormLabel component="legend">Output Format (Select at least one)</FormLabel>
264-271: Consider extracting duplicated button styles.The button styles are duplicated between the Extract and Scrape tabs. Consider extracting to a shared constant for maintainability.
Example refactor:
const actionButtonStyles = { bgcolor: '#ff00c3', py: 1.4, fontSize: '1rem', textTransform: 'none', maxWidth: 700, borderRadius: 2 }; // Then use: sx={actionButtonStyles}Also applies to: 468-475
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/components/robot/pages/RobotCreate.tsx(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/components/robot/pages/RobotCreate.tsx (2)
src/context/globalInfo.tsx (1)
useGlobalInfoStore(189-189)src/api/storage.ts (1)
createScrapeRobot(31-59)
🔇 Additional comments (1)
src/components/robot/pages/RobotCreate.tsx (1)
442-445: Backend validates URL format properly—no action needed.The frontend validation (line 442-445) only checks for non-empty input, but the backend endpoint at
server/src/routes/storage.ts:456-459validates URL format using thenew URL()constructor, which properly parses and rejects invalid URLs. This backend validation is called before any processing occurs, adequately addressing malformed or invalid URL inputs.
Summary by CodeRabbit
New Features
Updates
✏️ Tip: You can customize this high-level summary in your review settings.