Skip to content

Conversation

@amhsirak
Copy link
Member

@amhsirak amhsirak commented Nov 17, 2025

  • markdown
  • html
  • api
  • UI
  • telemetry
  • docs

Summary by CodeRabbit

  • New Features

    • Create "scrape" robots from a URL with selectable output formats (Markdown/HTML).
    • Tabbed Robot creation UI to choose Extract or Scrape modes.
    • View and download Markdown and HTML outputs from run results.
  • Updates

    • Robot metadata now tracks type, URL, and output formats.
    • Google Sheets and Airtable integrations hidden for scrape-type robots.
    • Recording URL can be updated and used for scrape workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

@amhsirak amhsirak marked this pull request as draft November 17, 2025 14:22
@coderabbitai
Copy link

coderabbitai bot commented Nov 17, 2025

Walkthrough

Adds scrape-type robots: Playwright-based page fetch/cleanup, Turndown-based HTML→Markdown conversion, new scrape endpoints and UI, and propagation of requested output formats (markdown/html) through API, worker/scheduler, run serialization, sockets, and webhooks.

Changes

Cohort / File(s) Summary
Dependencies
package.json
Added turndown, joplin-turndown-plugin-gfm to dependencies and @types/turndown to devDependencies.
Markdown utilities
server/src/markdownify/markdown.ts, server/src/markdownify/scrape.ts, server/src/markdownify/test.ts
New Turndown-based parseMarkdown(html, baseUrl) with custom rules and link cleaning; Playwright-backed convertPageToMarkdown(url) / convertPageToHTML(url) that sanitize DOM before conversion; small integration test script.
Server run API
server/src/api/record.ts
Extended run handlers to accept requestedFormats (markdown/html), carry conversion outputs into serializableOutput, include optional markdown/html in responses, sockets and webhook payloads.
Worker & scheduler
server/src/pgboss-worker.ts, server/src/workflow-management/scheduler/index.ts
Added scrape execution path for recording_meta.type === 'scrape': validate URL, run conversions per formats, populate serializableOutput/binaryOutput, update run status, emit socket events, send webhooks, capture metrics, and ensure cleanup on success/failure.
Storage routes
server/src/routes/storage.ts
PUT /recordings/:id updates recording_meta.url when targetUrl provided and rewrites workflow goto targets; added POST /recordings/scrape endpoint to create scrape-type robots with format validation (duplicate insertion visible in diff).
Models & types (server + client)
server/src/models/Robot.ts, src/context/globalInfo.tsx, src/components/robot/pages/* (Edit/Duplicate/Settings)
Added optional `type?: 'extract'
Frontend API
src/api/storage.ts
Added createScrapeRobot(url, name?, formats = ['markdown']) which POSTs to /storage/recordings/scrape.
Frontend UI
src/components/robot/RecordingsTable.tsx, src/components/robot/pages/RobotCreate.tsx, src/components/robot/pages/RobotIntegrationPage.tsx, src/components/run/RunContent.tsx
RobotCreate: new Extract / Scrape tabs with scrape form and format selection wired to createScrapeRobot. RecordingsTable: pass robotType prop and hide Retrain/Duplicate for scrape. IntegrationPage: hide Google Sheets/Airtable for scrape robots. RunContent: render/download Markdown/HTML outputs when present.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant FE as Frontend
    participant API as Backend (/storage)
    participant DB as Database

    User->>FE: Create scrape robot (url, name, formats)
    FE->>API: POST /recordings/scrape { url, name, formats }
    API->>API: Validate url & formats
    API->>DB: Insert Robot (meta.type = scrape, url, formats)
    DB-->>API: Created Robot
    API-->>FE: 201 Created (robot)
Loading
sequenceDiagram
    participant Client
    participant Scheduler as scheduler/index.ts
    participant Worker as pgboss-worker.ts
    participant Scrape as markdownify/scrape.ts
    participant Markdown as markdownify/markdown.ts
    participant Socket as WebSocket
    participant Webhook as Webhook
    rect rgba(200,230,255,0.5)
      Client->>Scheduler: POST /robots/:id/runs (requestedFormats?)
      Scheduler->>Worker: start run (includes formats)
      Worker->>Socket: emit run-started
    end
    rect rgba(220,255,200,0.5)
      Worker->>Scrape: convertPageToMarkdown/convertPageToHTML(url)
      Scrape->>Scrape: Playwright fetch + sanitize DOM
      Scrape->>Markdown: parseMarkdown(cleanedHtml, baseUrl)
      Markdown-->>Scrape: markdown text
      Scrape-->>Worker: conversion results (markdown/html)
    end
    rect rgba(255,230,200,0.5)
      Worker->>DB: update run (serializableOutput, status: success/failed)
      Worker->>Socket: emit run-completed
      Worker->>Webhook: send run_completed payload (with markdown/html if present)
      Worker-->>Client: run result
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Areas to focus review on:
    • server/src/markdownify/markdown.ts — Turndown rules, link handling, URL cleaning, and error handling.
    • server/src/markdownify/scrape.ts — Playwright usage, DOM sanitization script, navigation timeouts, and browser closure on all paths.
    • server/src/pgboss-worker.ts and server/src/workflow-management/scheduler/index.ts — scrape branch integration with run lifecycle, socket/webhook payload composition, and correct serialization of outputs.
    • server/src/routes/storage.ts — validation, duplicated endpoint insertion in diff, and error handling for /recordings/scrape.
    • Frontend type changes (RobotMeta) and UI branching: ensure no regressions in existing robot flows.

Possibly related PRs

  • #485 — Overlaps with PUT /recordings/:id changes and goto/targetUrl handling in server/src/routes/storage.ts.
  • #610 — Related changes to run execution flow and webhook payloads; may overlap with worker/API webhook logic.
  • #754 — Overlaps on RobotMeta shape changes across client components (type/formats).

Suggested labels

Type: Feature, Type: Integration, Scope: UI/UX

Suggested reviewers

  • RohitR311

Poem

🐰 I hopped through tidy pages, nibbled scripts and tags away,
Playwright swept the windows, Turndown turned the mess to hay,
Scrape robots hum and gather markdown and HTML in a crate,
I stamp my paw and celebrate — outputs neat upon my plate! 🎋

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: scrape [html + markdown]' clearly describes the main feature added in this PR: support for scraping web pages and converting them to HTML and Markdown formats.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch markdownify

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@amhsirak amhsirak changed the title wip: markdownify wip: markdownify + plain text Nov 17, 2025
@amhsirak amhsirak changed the title wip: markdownify + plain text wip: markdownify Nov 19, 2025
@amhsirak amhsirak changed the title feat: markdownify feat: scrape [html + markdown] Nov 20, 2025
@amhsirak amhsirak marked this pull request as ready for review November 20, 2025 14:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/components/robot/pages/RobotEditPage.tsx (1)

723-734: Fix potential runtime error in getTargetUrl when what is missing

getTargetUrl currently does:

const lastPair =
  robot?.recording.workflow[robot?.recording.workflow.length - 1];
url = lastPair?.what.find((action) => action.action === "goto")?.args?.[0];

If lastPair exists but lastPair.what is undefined/null, lastPair?.what evaluates to undefined and the subsequent .find will throw. A safer version would guard both workflow length and what:

-  const lastPair =
-    robot?.recording.workflow[robot?.recording.workflow.length - 1];
-  url = lastPair?.what.find((action) => action.action === "goto")
-    ?.args?.[0];
+  const workflow = robot?.recording?.workflow;
+  const lastPair =
+    workflow && workflow.length > 0
+      ? workflow[workflow.length - 1]
+      : undefined;
+  const gotoAction = lastPair?.what?.find(
+    (action) => action.action === "goto"
+  );
+  url = gotoAction?.args?.[0];

Same pattern appears in RobotDuplicatePage.tsx; consider applying the same fix (or factoring this into a shared helper) to avoid similar crashes in both places.

Also applies to: 817-823

🧹 Nitpick comments (10)
src/components/run/RunContent.tsx (1)

727-735: Extract inline download logic to improve consistency.

The HTML download button contains inline download logic, while the Markdown download uses the extracted downloadMarkdown function. For consistency and maintainability, extract the HTML download logic to a separate function.

Add this function near the downloadMarkdown function (after line 413):

const downloadHTML = (content: string, filename: string) => {
  const blob = new Blob([content], { type: 'text/html;charset=utf-8;' });
  const url = URL.createObjectURL(blob);

  const link = document.createElement("a");
  link.href = url;
  link.setAttribute("download", filename);
  document.body.appendChild(link);
  link.click();
  document.body.removeChild(link);

  setTimeout(() => {
    URL.revokeObjectURL(url);
  }, 100);
};

Then update the button onClick:

                     <Box sx={{ mt: 2 }}>
                       <Button
-                        onClick={() => {
-                          const blob = new Blob([htmlContent], { type: 'text/html;charset=utf-8;' });
-                          const url = URL.createObjectURL(blob);
-                          const link = document.createElement("a");
-                          link.href = url;
-                          link.download = "output.html";
-                          link.click();
-                          setTimeout(() => URL.revokeObjectURL(url), 100);
-                        }}
+                        onClick={() => downloadHTML(htmlContent, 'output.html')}
                         sx={{ color: '#FF00C3', textTransform: 'none' }}
                       >
                         Download
                       </Button>
                     </Box>
src/components/robot/pages/RobotIntegrationPage.tsx (1)

734-781: Add user feedback for disabled integrations.

While the buttons are correctly disabled for scrape robots with appropriate visual feedback, users receive no explanation for why these integrations are unavailable. Consider adding a tooltip or informational message.

Add a tooltip to inform users:

               <Button
                 variant="outlined"
                 disabled={isScrapeRobot}
+                title={isScrapeRobot ? "Google Sheets integration is not available for scrape-type robots" : ""}
                 onClick={() => {
                   if (isScrapeRobot) return;

Alternatively, add an informational alert above the integration buttons when isScrapeRobot is true:

{isScrapeRobot && (
  <Alert severity="info" sx={{ mb: 2, maxWidth: "900px" }}>
    Google Sheets and Airtable integrations are not available for scrape-type robots. 
    Use Webhooks or MCP for scrape robot integrations.
  </Alert>
)}
src/components/robot/pages/RobotDuplicatePage.tsx (1)

19-30: RobotMeta type narrowing and formats look consistent; consider centralizing the type

The additions of type?: 'extract' | 'scrape' and formats?: ('markdown' | 'html')[] are aligned with the new scrape/markdown flow and are type-safe here. Since the same RobotMeta shape now exists in multiple pages (e.g., RobotEditPage.tsx), consider moving this interface to a shared types module to avoid drift the next time metadata fields change.

server/src/pgboss-worker.ts (1)

192-303: Avoid double failure handling for scrape runs in processRunExecution

The scrape branch already updates the run to failed, emits run-completed (failed) events, and destroys the browser inside its catch. Re‑throwing the error here causes the outer catch (executionError) to run as well, which:

  • updates the run status/log again,
  • emits another run-completed failure event,
  • and sends a separate run_failed webhook.

This duplication makes failure behavior harder to reason about and can spam downstream consumers.

Consider either:

  • Not rethrowing from the scrape catch (return { success: false } after local handling), or
  • Adding a flag to skip the generic failure path when the scrape branch has already fully handled the error.

That keeps scrape failures single-sourced and easier to maintain.

server/src/workflow-management/scheduler/index.ts (1)

328-358: Simplify scrape failure handling to avoid duplicate failure events

On scrape failure you:

  • update the run to failed,
  • emit run-completed (failed) over both namespaces,
  • destroy the browser,
  • and then throw error;.

The outer catch at the bottom of executeRun will then:

  • update the run to failed again,
  • send run_failed webhooks,
  • and emit another run-completed failure event.

This can cause double failure events for listeners (and makes it harder to reason about what fired when), even if the second emit happens after the namespace has been cleaned up.

A cleaner pattern would be to:

  • keep all scrape-specific failure handling (including webhooks if needed) inside this branch and return false instead of rethrowing, or
  • tag the error as “already handled” and branch accordingly in the outer catch.

That way each scheduled scrape run failure results in exactly one set of notifications.

src/components/robot/RecordingsTable.tsx (1)

776-783: Consider defensive check for undefined robotType.

The conditional checks robotType !== 'scrape' without verifying that robotType is defined. If row.type is undefined (e.g., for older robots created before this field was added), the menu items will still render, which might not be the intended behavior.

Consider adding a defensive check:

-        {robotType !== 'scrape' && (
+        {robotType && robotType !== 'scrape' && (
           <MenuItem onClick={() => { handleRetrain(); handleClose(); }}>

-        {robotType !== 'scrape' && (
+        {robotType && robotType !== 'scrape' && (
           <MenuItem onClick={() => { handleDuplicate(); handleClose(); }}>

Or provide a default value in the row mapping (lines 252-273) to ensure type is always defined.

Also applies to: 795-800

src/components/robot/pages/RobotCreate.tsx (2)

402-402: Accessibility: FormControl label should use FormLabel component.

The label "Output Format (Select at least one)" is rendered as a plain <p> tag instead of using the Material-UI FormLabel component, which would provide better accessibility and consistent styling.

Apply this diff:

                 <FormControl component="fieldset" sx={{ width: '100%', textAlign: 'left' }}>
-                  <p>Output Format (Select at least one)</p>
+                  <FormLabel component="legend">Output Format (Select at least one)</FormLabel>
                   <FormControlLabel

381-399: Consider adding URL validation before form submission.

While the onClick handler validates the URL at line 441-443, there's no real-time feedback when the user is typing. Consider adding URL validation on the TextField itself or showing a helper text to guide the user.

Example pattern:

const [urlError, setUrlError] = useState('');

const validateUrl = (value: string) => {
  try {
    new URL(value);
    setUrlError('');
  } catch {
    setUrlError('Please enter a valid URL');
  }
};

Then add to the TextField:

error={!!urlError}
helperText={urlError}
onBlur={(e) => validateUrl(e.target.value)}
server/src/api/record.ts (1)

692-695: Consider error handling for conversion failures.

The markdown and HTML conversions are called independently, but if one fails while the other succeeds, the entire run fails and neither result is saved. Consider handling partial success scenarios where at least one format conversion succeeds.

Consider this approach:

const serializableOutput: any = {};
const errors: string[] = [];

// Markdown conversion
if (formats.includes('markdown')) {
    try {
        markdown = await convertPageToMarkdown(url);
        serializableOutput.markdown = [{ content: markdown }];
    } catch (err: any) {
        errors.push(`Markdown conversion failed: ${err.message}`);
    }
}

// HTML conversion
if (formats.includes('html')) {
    try {
        html = await convertPageToHTML(url);
        serializableOutput.html = [{ content: html }];
    } catch (err: any) {
        errors.push(`HTML conversion failed: ${err.message}`);
    }
}

if (Object.keys(serializableOutput).length === 0) {
    throw new Error(`All conversions failed: ${errors.join('; ')}`);
}

Also applies to: 698-701

server/src/markdownify/scrape.ts (1)

12-12: Very long timeout without justification or fallback.

Both functions use a 100-second (100000ms) timeout for page load, which is unusually long. This could cause runs to hang for extended periods if a page is slow or unresponsive. Consider using a shorter timeout with proper error handling.

Apply this diff to use a more reasonable timeout:

-  await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
+  await page.goto(url, { waitUntil: "networkidle", timeout: 30000 });

Or make it configurable:

const SCRAPE_TIMEOUT = process.env.SCRAPE_TIMEOUT_MS 
  ? parseInt(process.env.SCRAPE_TIMEOUT_MS) 
  : 30000;

await page.goto(url, { waitUntil: "networkidle", timeout: SCRAPE_TIMEOUT });

Also applies to: 67-67

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 334fbbc and 6477fee.

📒 Files selected for processing (18)
  • package.json (3 hunks)
  • server/src/api/record.ts (11 hunks)
  • server/src/markdownify/markdown.ts (1 hunks)
  • server/src/markdownify/scrape.ts (1 hunks)
  • server/src/markdownify/test.ts (1 hunks)
  • server/src/models/Robot.ts (1 hunks)
  • server/src/pgboss-worker.ts (2 hunks)
  • server/src/routes/storage.ts (4 hunks)
  • server/src/workflow-management/scheduler/index.ts (3 hunks)
  • src/api/storage.ts (1 hunks)
  • src/components/robot/RecordingsTable.tsx (3 hunks)
  • src/components/robot/pages/RobotCreate.tsx (2 hunks)
  • src/components/robot/pages/RobotDuplicatePage.tsx (1 hunks)
  • src/components/robot/pages/RobotEditPage.tsx (2 hunks)
  • src/components/robot/pages/RobotIntegrationPage.tsx (2 hunks)
  • src/components/robot/pages/RobotSettingsPage.tsx (1 hunks)
  • src/components/run/RunContent.tsx (5 hunks)
  • src/context/globalInfo.tsx (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (8)
server/src/markdownify/markdown.ts (1)
maxun-core/src/browserSide/scraper.js (1)
  • depth (956-956)
server/src/markdownify/scrape.ts (1)
server/src/markdownify/markdown.ts (1)
  • parseMarkdown (1-98)
server/src/pgboss-worker.ts (4)
maxun-core/src/interpret.ts (1)
  • run (1620-1656)
server/src/markdownify/scrape.ts (2)
  • convertPageToMarkdown (8-57)
  • convertPageToHTML (63-111)
server/src/routes/webhook.ts (1)
  • sendWebhook (404-434)
server/src/browser-management/controller.ts (1)
  • destroyRemoteBrowser (96-139)
server/src/routes/storage.ts (1)
server/src/routes/record.ts (1)
  • AuthenticatedRequest (25-27)
src/components/robot/pages/RobotCreate.tsx (2)
src/context/globalInfo.tsx (1)
  • useGlobalInfoStore (189-189)
src/api/storage.ts (1)
  • createScrapeRobot (31-59)
server/src/markdownify/test.ts (1)
server/src/markdownify/scrape.ts (1)
  • convertPageToMarkdown (8-57)
server/src/workflow-management/scheduler/index.ts (3)
server/src/markdownify/scrape.ts (2)
  • convertPageToMarkdown (8-57)
  • convertPageToHTML (63-111)
server/src/routes/webhook.ts (1)
  • sendWebhook (404-434)
server/src/browser-management/controller.ts (1)
  • destroyRemoteBrowser (96-139)
server/src/api/record.ts (4)
server/src/markdownify/scrape.ts (2)
  • convertPageToMarkdown (8-57)
  • convertPageToHTML (63-111)
server/src/routes/webhook.ts (1)
  • sendWebhook (404-434)
server/src/browser-management/controller.ts (1)
  • destroyRemoteBrowser (96-139)
server/src/workflow-management/scheduler/index.ts (1)
  • handleRunRecording (597-622)
🔇 Additional comments (7)
src/components/robot/pages/RobotSettingsPage.tsx (1)

19-21: LGTM! Type extensions are clean and backward-compatible.

The optional fields added to RobotMeta (type, url, formats) align with the broader scrape functionality introduced in this PR and maintain backward compatibility.

src/context/globalInfo.tsx (1)

30-32: LGTM! RobotMeta extensions are consistent.

The optional fields added to RobotMeta align with the same changes made across multiple files (RobotSettingsPage.tsx, Robot.ts model). The consistency ensures type safety throughout the application.

server/src/models/Robot.ts (1)

12-14: LGTM! Server-side type extensions align with frontend.

The optional fields added to the RobotMeta interface on the server side correctly mirror the frontend type definitions, ensuring type consistency across the full stack.

src/components/robot/pages/RobotIntegrationPage.tsx (1)

131-132: Good scrape robot detection.

The isScrapeRobot flag cleanly derives from the robot metadata and is used consistently to disable incompatible integrations.

package.json (1)

49-49: All dependencies are current and secure.

The three dependencies specified—turndown (^7.2.2), joplin-turndown-plugin-gfm (^1.0.12), and @types/turndown (^5.0.6)—match their latest available versions and have no known security vulnerabilities according to GitHub's security advisory database.

src/components/robot/pages/RobotEditPage.tsx (1)

19-30: RobotMeta type/formats update is sound and consistent

The constrained type?: 'extract' | 'scrape' and added formats?: ('markdown' | 'html')[] align with the backend model and other UI code. No runtime impact here; just a safer, more explicit shape.

server/src/api/record.ts (1)

1047-1047: Review comment is incorrect: waitForRunCompletion still returns raw JSON, not a model instance.

The function uses raw: true in the Sequelize query at line 1043, which means it returns raw JSON data, not a Run model instance. The return value at line 1050 is this raw JSON object, which is correctly passed to formatRunResponse that expects plain object properties. There is no breaking change here.

Likely an incorrect or invalid review comment.

Comment on lines 665 to 807
if (recording.recording_meta.type === 'scrape') {
logger.log('info', `Executing scrape robot for API run ${id}`);

let formats = recording.recording_meta.formats || ['markdown'];

// Override if API request defines formats
if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) {
formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f));
}

await run.update({
status: 'running',
log: `Converting page to: ${formats.join(', ')}`
});

try {
const url = recording.recording_meta.url;

if (!url) {
throw new Error('No URL specified for markdown robot');
}

let markdown = '';
let html = '';
const serializableOutput: any = {};

// Markdown conversion
if (formats.includes('markdown')) {
markdown = await convertPageToMarkdown(url);
serializableOutput.markdown = [{ content: markdown }];
}

// HTML conversion
if (formats.includes('html')) {
html = await convertPageToHTML(url);
serializableOutput.html = [{ content: html }];
}

await run.update({
status: 'success',
finishedAt: new Date().toLocaleString(),
log: `${formats.join(', ')} conversion completed successfully`,
serializableOutput,
binaryOutput: {},
});

logger.log('info', `Markdown robot execution completed for API run ${id}`);

// Push success socket event
try {
const completionData = {
runId: plainRun.runId,
robotMetaId: plainRun.robotMetaId,
robotName: recording.recording_meta.name,
status: 'success',
finishedAt: new Date().toLocaleString()
};

serverIo
.of('/queued-run')
.to(`user-${userId}`)
.emit('run-completed', completionData);
} catch (socketError: any) {
logger.log(
'warn',
`Failed to send run-completed notification for markdown robot run ${id}: ${socketError.message}`
);
}

// Build webhook payload
const webhookPayload: any = {
robot_id: plainRun.robotMetaId,
run_id: plainRun.runId,
robot_name: recording.recording_meta.name,
status: 'success',
started_at: plainRun.startedAt,
finished_at: new Date().toLocaleString(),
metadata: {
browser_id: plainRun.browserId,
user_id: userId,
},
};

if (formats.includes('markdown')) webhookPayload.markdown = markdown;
if (formats.includes('html')) webhookPayload.html = html;

try {
await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload);
logger.log(
'info',
`Webhooks sent successfully for markdown robot API run ${plainRun.runId}`
);
} catch (webhookError: any) {
logger.log(
'warn',
`Failed to send webhooks for markdown robot run ${plainRun.runId}: ${webhookError.message}`
);
}

await destroyRemoteBrowser(plainRun.browserId, userId);

return {
success: true,
interpretationInfo: run.toJSON()
};
} catch (error: any) {
logger.log(
'error',
`${formats.join(', ')} conversion failed for API run ${id}: ${error.message}`
);

await run.update({
status: 'failed',
finishedAt: new Date().toLocaleString(),
log: `${formats.join(', ')} conversion failed: ${error.message}`,
});

// Send failure socket event
try {
const failureData = {
runId: plainRun.runId,
robotMetaId: plainRun.robotMetaId,
robotName: recording.recording_meta.name,
status: 'failed',
finishedAt: new Date().toLocaleString()
};

serverIo
.of('/queued-run')
.to(`user-${userId}`)
.emit('run-completed', failureData);
} catch (socketError: any) {
logger.log(
'warn',
`Failed to send run-failed notification for markdown robot run ${id}: ${socketError.message}`
);
}

await destroyRemoteBrowser(plainRun.browserId, userId);

throw error;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent naming: logs reference "markdown robot" but supports multiple formats.

Throughout the scrape execution block, logging messages refer to "markdown robot" even though the code correctly handles both markdown and html formats based on the formats array. This creates confusion about the actual functionality.

Update log messages to be format-agnostic:

-            logger.log('info', `Executing scrape robot for API run ${id}`);
+            logger.log('info', `Executing scrape robot for API run ${id} with formats: ${formats.join(', ')}`);

             try {
                 const url = recording.recording_meta.url;

                 if (!url) {
-                    throw new Error('No URL specified for markdown robot');
+                    throw new Error('No URL specified for scrape robot');
                 }

-                logger.log('info', `Markdown robot execution completed for API run ${id}`);
+                logger.log('info', `Scrape robot execution completed for API run ${id}`);

                     logger.log(
                         'warn',
-                        `Failed to send run-completed notification for markdown robot run ${id}: ${socketError.message}`
+                        `Failed to send run-completed notification for scrape robot run ${id}: ${socketError.message}`
                     );

                 try {
                     await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload);
                     logger.log(
                         'info',
-                        `Webhooks sent successfully for markdown robot API run ${plainRun.runId}`
+                        `Webhooks sent successfully for scrape robot API run ${plainRun.runId}`
                     );
                 } catch (webhookError: any) {
                     logger.log(
                         'warn',
-                        `Failed to send webhooks for markdown robot run ${plainRun.runId}: ${webhookError.message}`
+                        `Failed to send webhooks for scrape robot run ${plainRun.runId}: ${webhookError.message}`
                     );
                 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (recording.recording_meta.type === 'scrape') {
logger.log('info', `Executing scrape robot for API run ${id}`);
let formats = recording.recording_meta.formats || ['markdown'];
// Override if API request defines formats
if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) {
formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f));
}
await run.update({
status: 'running',
log: `Converting page to: ${formats.join(', ')}`
});
try {
const url = recording.recording_meta.url;
if (!url) {
throw new Error('No URL specified for markdown robot');
}
let markdown = '';
let html = '';
const serializableOutput: any = {};
// Markdown conversion
if (formats.includes('markdown')) {
markdown = await convertPageToMarkdown(url);
serializableOutput.markdown = [{ content: markdown }];
}
// HTML conversion
if (formats.includes('html')) {
html = await convertPageToHTML(url);
serializableOutput.html = [{ content: html }];
}
await run.update({
status: 'success',
finishedAt: new Date().toLocaleString(),
log: `${formats.join(', ')} conversion completed successfully`,
serializableOutput,
binaryOutput: {},
});
logger.log('info', `Markdown robot execution completed for API run ${id}`);
// Push success socket event
try {
const completionData = {
runId: plainRun.runId,
robotMetaId: plainRun.robotMetaId,
robotName: recording.recording_meta.name,
status: 'success',
finishedAt: new Date().toLocaleString()
};
serverIo
.of('/queued-run')
.to(`user-${userId}`)
.emit('run-completed', completionData);
} catch (socketError: any) {
logger.log(
'warn',
`Failed to send run-completed notification for markdown robot run ${id}: ${socketError.message}`
);
}
// Build webhook payload
const webhookPayload: any = {
robot_id: plainRun.robotMetaId,
run_id: plainRun.runId,
robot_name: recording.recording_meta.name,
status: 'success',
started_at: plainRun.startedAt,
finished_at: new Date().toLocaleString(),
metadata: {
browser_id: plainRun.browserId,
user_id: userId,
},
};
if (formats.includes('markdown')) webhookPayload.markdown = markdown;
if (formats.includes('html')) webhookPayload.html = html;
try {
await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload);
logger.log(
'info',
`Webhooks sent successfully for markdown robot API run ${plainRun.runId}`
);
} catch (webhookError: any) {
logger.log(
'warn',
`Failed to send webhooks for markdown robot run ${plainRun.runId}: ${webhookError.message}`
);
}
await destroyRemoteBrowser(plainRun.browserId, userId);
return {
success: true,
interpretationInfo: run.toJSON()
};
} catch (error: any) {
logger.log(
'error',
`${formats.join(', ')} conversion failed for API run ${id}: ${error.message}`
);
await run.update({
status: 'failed',
finishedAt: new Date().toLocaleString(),
log: `${formats.join(', ')} conversion failed: ${error.message}`,
});
// Send failure socket event
try {
const failureData = {
runId: plainRun.runId,
robotMetaId: plainRun.robotMetaId,
robotName: recording.recording_meta.name,
status: 'failed',
finishedAt: new Date().toLocaleString()
};
serverIo
.of('/queued-run')
.to(`user-${userId}`)
.emit('run-completed', failureData);
} catch (socketError: any) {
logger.log(
'warn',
`Failed to send run-failed notification for markdown robot run ${id}: ${socketError.message}`
);
}
await destroyRemoteBrowser(plainRun.browserId, userId);
throw error;
}
}
if (recording.recording_meta.type === 'scrape') {
logger.log('info', `Executing scrape robot for API run ${id} with formats: ${formats.join(', ')}`);
let formats = recording.recording_meta.formats || ['markdown'];
// Override if API request defines formats
if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) {
formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f));
}
await run.update({
status: 'running',
log: `Converting page to: ${formats.join(', ')}`
});
try {
const url = recording.recording_meta.url;
if (!url) {
throw new Error('No URL specified for scrape robot');
}
let markdown = '';
let html = '';
const serializableOutput: any = {};
// Markdown conversion
if (formats.includes('markdown')) {
markdown = await convertPageToMarkdown(url);
serializableOutput.markdown = [{ content: markdown }];
}
// HTML conversion
if (formats.includes('html')) {
html = await convertPageToHTML(url);
serializableOutput.html = [{ content: html }];
}
await run.update({
status: 'success',
finishedAt: new Date().toLocaleString(),
log: `${formats.join(', ')} conversion completed successfully`,
serializableOutput,
binaryOutput: {},
});
logger.log('info', `Scrape robot execution completed for API run ${id}`);
// Push success socket event
try {
const completionData = {
runId: plainRun.runId,
robotMetaId: plainRun.robotMetaId,
robotName: recording.recording_meta.name,
status: 'success',
finishedAt: new Date().toLocaleString()
};
serverIo
.of('/queued-run')
.to(`user-${userId}`)
.emit('run-completed', completionData);
} catch (socketError: any) {
logger.log(
'warn',
`Failed to send run-completed notification for scrape robot run ${id}: ${socketError.message}`
);
}
// Build webhook payload
const webhookPayload: any = {
robot_id: plainRun.robotMetaId,
run_id: plainRun.runId,
robot_name: recording.recording_meta.name,
status: 'success',
started_at: plainRun.startedAt,
finished_at: new Date().toLocaleString(),
metadata: {
browser_id: plainRun.browserId,
user_id: userId,
},
};
if (formats.includes('markdown')) webhookPayload.markdown = markdown;
if (formats.includes('html')) webhookPayload.html = html;
try {
await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload);
logger.log(
'info',
`Webhooks sent successfully for scrape robot API run ${plainRun.runId}`
);
} catch (webhookError: any) {
logger.log(
'warn',
`Failed to send webhooks for scrape robot run ${plainRun.runId}: ${webhookError.message}`
);
}
await destroyRemoteBrowser(plainRun.browserId, userId);
return {
success: true,
interpretationInfo: run.toJSON()
};
} catch (error: any) {
logger.log(
'error',
`${formats.join(', ')} conversion failed for API run ${id}: ${error.message}`
);
await run.update({
status: 'failed',
finishedAt: new Date().toLocaleString(),
log: `${formats.join(', ')} conversion failed: ${error.message}`,
});
// Send failure socket event
try {
const failureData = {
runId: plainRun.runId,
robotMetaId: plainRun.robotMetaId,
robotName: recording.recording_meta.name,
status: 'failed',
finishedAt: new Date().toLocaleString()
};
serverIo
.of('/queued-run')
.to(`user-${userId}`)
.emit('run-completed', failureData);
} catch (socketError: any) {
logger.log(
'warn',
`Failed to send run-failed notification for markdown robot run ${id}: ${socketError.message}`
);
}
await destroyRemoteBrowser(plainRun.browserId, userId);
throw error;
}
}

Comment on lines +670 to +673
// Override if API request defines formats
if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) {
formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing validation: requestedFormats should be validated before use.

When requestedFormats is provided, line 672 filters it to only include valid formats, but there's no check to ensure the resulting array isn't empty. If a caller passes formats: ['invalid'], the filtered array would be empty and no output would be generated, but the run would still succeed.

Apply this diff to add validation:

             // Override if API request defines formats
             if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) {
                 formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f));
+                if (formats.length === 0) {
+                    throw new Error('No valid formats specified. Supported formats: markdown, html');
+                }
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Override if API request defines formats
if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) {
formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f));
}
// Override if API request defines formats
if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) {
formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f));
if (formats.length === 0) {
throw new Error('No valid formats specified. Supported formats: markdown, html');
}
}
🤖 Prompt for AI Agents
In server/src/api/record.ts around lines 670 to 673, the code filters
requestedFormats but does not validate the result; after filtering, check
whether the resulting formats array is non-empty and if it is empty return a 400
Bad Request (or throw a validation error) with a clear message like
"requestedFormats contains no supported formats; supported: markdown, html";
ensure you stop further processing when invalid so the run does not succeed with
no outputs.

Comment on lines +88 to +97
// Convert HTML → Markdown
try {
let out = await t.turndown(tidiedHtml);
out = fixBrokenLinks(out);
out = stripSkipLinks(out);
return out.trim();
} catch (err) {
console.error("HTML→Markdown failed", { err });
return "";
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Guard against tidyHtml returning null before passing to Turndown

tidyHtml currently does:

function tidyHtml(html: string): string {
  const $ = cheerio.load(html);
  // ...
  manuallyCleanedElements.forEach((tag) => $(tag).remove());
  return $("body").html();
}

But $("body").html() can be null, so tidiedHtml in parseMarkdown may be null even though the function is typed as returning string. That value is then passed into t.turndown(...), which can throw and push you into the catch path unnecessarily.

Two concrete improvements:

-function tidyHtml(html: string): string {
+function tidyHtml(html: string): string {
   const cheerio = require("cheerio");
   const $ = cheerio.load(html);
   // ...
-  manuallyCleanedElements.forEach((tag) => $(tag).remove());
-  return $("body").html();
+  manuallyCleanedElements.forEach((tag) => $(tag).remove());
+
+  const bodyHtml = $("body").html();
+  return bodyHtml || html;
}

and/or in parseMarkdown:

-  const tidiedHtml = tidyHtml(html);
+  const tidiedHtml = tidyHtml(html) || html;

This avoids spurious conversion failures and aligns the implementation with the declared return type.

Also applies to: 124-144


I'll verify the review comment by examining the actual code and confirming the Cheerio behavior.





🏁 Script executed:

#!/bin/bash
# Examine the markdown.ts file to see the parseMarkdown function and tidyHtml implementation
fd -e ts -n "markdown.ts" | head -5 | xargs -I {} cat "{}"

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Search for tidyHtml function definition to confirm the implementation
rg -n "function tidyHtml" --type ts -A 15

Length of output: 920


🏁 Script executed:

#!/bin/bash
# Search for parseMarkdown to see how tidiedHtml is used
rg -n "const tidiedHtml" --type ts -B 5 -A 10

Length of output: 1063


🏁 Script executed:

#!/bin/bash
# Check if cheerio is in package.json and what version
fd -e json "package.json" | head -1 | xargs -I {} grep -A 50 '"dependencies"' {} | grep -i cheerio

Length of output: 0


🌐 Web query:

Cheerio .html() method return type null or string

💡 Result:

Cheerio's .html() returns string | null — it returns the HTML of the first matched element, or null if nothing is matched. [1][2]

Example (JS/TS):
const $ = cheerio.load('');
const v = $('div').html(); // v is string | null — here null if no

exists

Sources: Cheerio API/docs. [1] [2]

References
[1] Cheerio — Classes: Cheerio.html() signature. [2] Cheerio — CheerioAPI / docs.



---

🏁 Script executed:

```shell
#!/bin/bash
# Get the complete tidyHtml function (should be around lines 124-144)
sed -n '124,150p' server/src/markdownify/markdown.ts

Length of output: 501


🏁 Script executed:

#!/bin/bash
# Also check the full parseMarkdown function around line 88-97
sed -n '1,100p' server/src/markdownify/markdown.ts | tail -30

Length of output: 600


Guard against tidyHtml returning null before passing to Turndown

tidyHtml at line 124 declares a return type of string, but currently returns $("body").html() directly. Cheerio's .html() method returns string | null, so if the body element is missing or empty, the function returns null despite the type annotation.

This null value is then passed to t.turndown(tidiedHtml) at line 91, which expects a string. While the try/catch will handle the resulting error, this creates unnecessary failures and obscures actual issues.

Two concrete improvements:

At line 124–144 in tidyHtml:

function tidyHtml(html: string): string {
  const cheerio = require("cheerio");
  const $ = cheerio.load(html);
  // ...
  manuallyCleanedElements.forEach((tag) => $(tag).remove());
  
  const bodyHtml = $("body").html();
  return bodyHtml || html;
}

Alternatively at line 12 in parseMarkdown:

const tidiedHtml = tidyHtml(html) || html;

This aligns the implementation with the declared return type and prevents unnecessary conversion failures.

🤖 Prompt for AI Agents
In server/src/markdownify/markdown.ts around lines 88–97 (and tidyHtml
implementation around lines 124–144), tidyHtml can return null because cheerio's
$("body").html() is string | null; update tidyHtml to coerce a non-null string
(e.g., const bodyHtml = $("body").html(); return bodyHtml || html;) or,
alternatively, change the caller at line ~88 to use const tidiedHtml =
tidyHtml(html) || html before passing to t.turndown; ensure the value passed to
turndown is always a string to prevent runtime errors.

Comment on lines +8 to +57
export async function convertPageToMarkdown(url: string): Promise<string> {
const browser = await chromium.launch();
const page = await browser.newPage();

await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });

await page.addInitScript(() => {
const selectors = [
"script",
"style",
"link[rel='stylesheet']",
"noscript",
"meta",
"svg",
"img",
"picture",
"source",
"video",
"audio",
"iframe",
"object",
"embed"
];

selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(e => e.remove());
});

// Remove inline event handlers (onclick, onload…)
const all = document.querySelectorAll("*");
all.forEach(el => {
[...el.attributes].forEach(attr => {
if (attr.name.startsWith("on")) {
el.removeAttribute(attr.name);
}
});
});
});

// Re-extract HTML after cleanup
const cleanedHtml = await page.evaluate(() => {
return document.documentElement.outerHTML;
});

await browser.close();

// Convert cleaned HTML → Markdown
const markdown = await parseMarkdown(cleanedHtml, url);
return markdown;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing resource cleanup: browser should be closed in a finally block.

If an error occurs during page navigation or evaluation, the browser instance won't be closed, leading to resource leaks. Both functions should wrap browser operations in try-finally blocks.

Apply this diff to both functions:

 export async function convertPageToMarkdown(url: string): Promise<string> {
   const browser = await chromium.launch();
-  const page = await browser.newPage();
+  try {
+    const page = await browser.newPage();

-  await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
+    await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });

-  await page.addInitScript(() => {
-    // ... cleanup script
-  });
+    // ... rest of the logic

-  const cleanedHtml = await page.evaluate(() => {
-    return document.documentElement.outerHTML;
-  });
+    const cleanedHtml = await page.evaluate(() => {
+      return document.documentElement.outerHTML;
+    });

-  await browser.close();
+    const markdown = await parseMarkdown(cleanedHtml, url);
+    return markdown;
+  } finally {
+    await browser.close();
+  }
+}

Also applies to: 63-111

🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 8-57 (and similarly at 63-111),
the browser is launched and operations are performed but browser.close() is only
called on the happy path; wrap the Playwright flow in a try/finally so the
browser is closed regardless of errors. Specifically, after launching the
browser (and creating the page), move the navigation/evaluation logic into a try
block and put await browser.close() in a finally block (guarding for a possibly
undefined browser), optionally closing the page if created; rethrow the error
after cleanup so behavior is preserved.

Comment on lines +14 to +45
await page.addInitScript(() => {
const selectors = [
"script",
"style",
"link[rel='stylesheet']",
"noscript",
"meta",
"svg",
"img",
"picture",
"source",
"video",
"audio",
"iframe",
"object",
"embed"
];

selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(e => e.remove());
});

// Remove inline event handlers (onclick, onload…)
const all = document.querySelectorAll("*");
all.forEach(el => {
[...el.attributes].forEach(attr => {
if (attr.name.startsWith("on")) {
el.removeAttribute(attr.name);
}
});
});
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Significant code duplication: cleanup logic is identical in both functions.

The element removal and event handler cleanup code (lines 14-45 and 69-100) is duplicated verbatim between convertPageToMarkdown and convertPageToHTML. This violates the DRY principle and makes maintenance harder.

Extract the cleanup logic into a shared helper function:

async function cleanPage(page: Page): Promise<string> {
  await page.addInitScript(() => {
    const selectors = [
      "script",
      "style",
      "link[rel='stylesheet']",
      "noscript",
      "meta",
      "svg",
      "img",
      "picture",
      "source",
      "video",
      "audio",
      "iframe",
      "object",
      "embed"
    ];

    selectors.forEach(sel => {
      document.querySelectorAll(sel).forEach(e => e.remove());
    });

    // Remove inline event handlers
    const all = document.querySelectorAll("*");
    all.forEach(el => {
      [...el.attributes].forEach(attr => {
        if (attr.name.startsWith("on")) {
          el.removeAttribute(attr.name);
        }
      });
    });
  });

  return await page.evaluate(() => {
    return document.documentElement.outerHTML;
  });
}

export async function convertPageToMarkdown(url: string): Promise<string> {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
  const cleanedHtml = await cleanPage(page);
  await browser.close();

  const markdown = await parseMarkdown(cleanedHtml, url);
  return markdown;
}

export async function convertPageToHTML(url: string): Promise<string> {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
  const cleanedHtml = await cleanPage(page);
  await browser.close();

  return cleanedHtml;
}

Also applies to: 69-100

🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 14-45 and 69-100, the DOM
cleanup logic is duplicated; extract it into a single async helper (e.g.,
cleanPage(page: Page): Promise<string>) that calls page.addInitScript with the
selector removal and inline event-attribute cleanup, then returns await
page.evaluate(() => document.documentElement.outerHTML); replace the duplicated
blocks in convertPageToMarkdown and convertPageToHTML with a call to this
helper, keep launching/closing the browser and page handling the same, and add
any necessary imports/types for Playwright's Page.

⚠️ Potential issue | 🔴 Critical

Critical issue: addInitScript runs AFTER initial page load, so cleanup doesn't affect the fetched content.

The addInitScript is called after page.goto, which means it only affects subsequent navigations, not the current page that was just loaded. The cleanup script never runs on the initial page load, so all scripts, styles, images, etc. are still present in the extracted HTML.

Move the cleanup to execute on the loaded page using evaluate or call addInitScript before goto:

 export async function convertPageToMarkdown(url: string): Promise<string> {
   const browser = await chromium.launch();
   const page = await browser.newPage();

+  await page.addInitScript(() => {
+    // Cleanup script here...
+  });
+
   await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });

-  await page.addInitScript(() => {
-    // Cleanup script here...
-  });
-
-  // Re-extract HTML after cleanup
-  const cleanedHtml = await page.evaluate(() => {
+  const cleanedHtml = await page.evaluate(() => {
+    // Run cleanup directly in evaluate
+    const selectors = ["script", "style", ...];
+    selectors.forEach(sel => {
+      document.querySelectorAll(sel).forEach(e => e.remove());
+    });
+    
+    const all = document.querySelectorAll("*");
+    all.forEach(el => {
+      [...el.attributes].forEach(attr => {
+        if (attr.name.startsWith("on")) {
+          el.removeAttribute(attr.name);
+        }
+      });
+    });
+    
     return document.documentElement.outerHTML;
   });

Also applies to: 69-100

🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 14 to 45 (and similarly for
lines ~69-100), the cleanup script is being registered with page.addInitScript
after page.goto so it doesn't run on the initial navigation; move the cleanup to
run on the loaded page by either registering the addInitScript before calling
page.goto OR running the same DOM-removal logic immediately after navigation
using page.evaluate; update both places to ensure the selectors and inline-event
removal execute on the current document (call addInitScript before any
navigation or replace the post-load addInitScript with page.evaluate after await
page.goto).

Comment on lines 211 to 359
if (recording.recording_meta.type === 'scrape') {
logger.log('info', `Executing scrape robot for scheduled run ${id}`);

const formats = recording.recording_meta.formats || ['markdown'];

await run.update({
status: 'running',
log: `Converting page to: ${formats.join(', ')}`
});

try {
const runStartedData = {
runId: plainRun.runId,
robotMetaId: plainRun.robotMetaId,
robotName: recording.recording_meta.name,
status: 'running',
startedAt: plainRun.startedAt
};

serverIo.of('/queued-run').to(`user-${userId}`).emit('run-started', runStartedData);
logger.log(
'info',
`Markdown robot run started notification sent for run: ${plainRun.runId} to user-${userId}`
);
} catch (socketError: any) {
logger.log(
'warn',
`Failed to send run-started notification for markdown robot run ${plainRun.runId}: ${socketError.message}`
);
}

try {
const url = recording.recording_meta.url;

if (!url) {
throw new Error('No URL specified for markdown robot');
}

let markdown = '';
let html = '';
const serializableOutput: any = {};

// Markdown conversion
if (formats.includes('markdown')) {
markdown = await convertPageToMarkdown(url);
serializableOutput.markdown = [{ content: markdown }];
}

// HTML conversion
if (formats.includes('html')) {
html = await convertPageToHTML(url);
serializableOutput.html = [{ content: html }];
}

await run.update({
status: 'success',
finishedAt: new Date().toLocaleString(),
log: `${formats.join(', ')} conversion completed successfully`,
serializableOutput,
binaryOutput: {},
});

logger.log('info', `Markdown robot execution completed for scheduled run ${id}`);

// Run-completed socket notifications
try {
const completionData = {
runId: plainRun.runId,
robotMetaId: plainRun.robotMetaId,
robotName: recording.recording_meta.name,
status: 'success',
finishedAt: new Date().toLocaleString()
};

serverIo.of(plainRun.browserId).emit('run-completed', completionData);
serverIo.of('/queued-run').to(`user-${userId}`).emit('run-completed', completionData);
} catch (socketError: any) {
logger.log(
'warn',
`Failed to send run-completed notification for markdown robot run ${id}: ${socketError.message}`
);
}

// Webhook payload
const webhookPayload: any = {
robot_id: plainRun.robotMetaId,
run_id: plainRun.runId,
robot_name: recording.recording_meta.name,
status: 'success',
started_at: plainRun.startedAt,
finished_at: new Date().toLocaleString(),
metadata: {
browser_id: plainRun.browserId,
user_id: userId,
}
};

if (formats.includes('markdown')) webhookPayload.markdown = markdown;
if (formats.includes('html')) webhookPayload.html = html;

try {
await sendWebhook(plainRun.robotMetaId, 'run_completed', webhookPayload);
logger.log(
'info',
`Webhooks sent successfully for markdown robot scheduled run ${plainRun.runId}`
);
} catch (webhookError: any) {
logger.log(
'warn',
`Failed to send webhooks for markdown robot run ${plainRun.runId}: ${webhookError.message}`
);
}

await destroyRemoteBrowser(plainRun.browserId, userId);

return true;

} catch (error: any) {
logger.log('error', `${formats.join(', ')} conversion failed for scheduled run ${id}: ${error.message}`);

await run.update({
status: 'failed',
finishedAt: new Date().toLocaleString(),
log: `${formats.join(', ')} conversion failed: ${error.message}`,
});

try {
const failureData = {
runId: plainRun.runId,
robotMetaId: plainRun.robotMetaId,
robotName: recording.recording_meta.name,
status: 'failed',
finishedAt: new Date().toLocaleString()
};

serverIo.of(plainRun.browserId).emit('run-completed', failureData);
serverIo.of('/queued-run').to(`user-${userId}`).emit('run-completed', failureData);
} catch (socketError: any) {
logger.log(
'warn',
`Failed to send run-failed notification for markdown robot run ${id}: ${socketError.message}`
);
}

await destroyRemoteBrowser(plainRun.browserId, userId);

throw error;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Scrape path bypasses proxy/remote browser config and duplicates logic

The new scrape branch works functionally, but a few aspects are worth tightening:

  1. Proxy and browser config are ignored

    Earlier in this file you resolve proxy settings and create a remote browser for the run, but convertPageToMarkdown/HTML each launch their own Playwright browser with default options. That means:

    • user-specific proxy settings are not applied for scrape runs;
    • stealth/plugins or other browser configuration used for normal runs are bypassed.

    For users who rely on proxies to reach target sites, scrape runs will likely fail even though extract-type robots would succeed.

    Consider either:

    • passing proxy options into convertPageToMarkdown/HTML, or
    • refactoring them to operate on the already-initialized remote browser/page rather than spinning up a separate browser.
  2. Logic is duplicated with server/src/pgboss-worker.ts

    The scrape success path (formats handling, serializableOutput shape, socket notifications, webhook payload structure) is almost identical to the one added in pgboss-worker.ts. Extracting a shared helper (e.g., runScrapeConversion(recording, formats, run, userId, browserId, context)) would keep behavior consistent and make future changes (like the proxy handling above) less error-prone.


I'll verify the key claims in the review comment by examining the implementation and comparing with pgboss-worker.ts.


---

🏁 Script executed:

```shell
#!/bin/bash
# Find implementations of convertPageToMarkdown and convertPageToHTML
rg -n "convertPageToMarkdown|convertPageToHTML" -A 10 -B 2 --type=ts

Length of output: 9526


🏁 Script executed:

#!/bin/bash
# Check how proxy options are set up earlier in scheduler/index.ts
rg -n "proxy" server/src/workflow-management/scheduler/index.ts -B 3 -A 3 --type=ts

Length of output: 850


🏁 Script executed:

#!/bin/bash
# Extract the scrape success path logic from scheduler/index.ts and compare with pgboss-worker
sed -n '211,359p' server/src/workflow-management/scheduler/index.ts > /tmp/scheduler_scrape.ts
cat /tmp/scheduler_scrape.ts | head -150

Length of output: 4956


🏁 Script executed:

#!/bin/bash
# Look for similar scrape handling in pgboss-worker.ts
rg -n "convertPageToMarkdown|convertPageToHTML|scrape" server/src/pgboss-worker.ts -B 2 -A 15 --type=ts | head -100

Length of output: 3827


I need to verify one more claim about the failure handling doubling. Let me check the outer catch block.


---

🏁 Script executed:

```shell
#!/bin/bash
# Look for executeRun function and its outer catch block to verify double failure handling
rg -n "executeRun|catch.*error" server/src/workflow-management/scheduler/index.ts -B 2 -A 5 | tail -80

Length of output: 2075


🏁 Script executed:

#!/bin/bash
# Check the full function structure around executeRun to understand error propagation
sed -n '130,220p' server/src/workflow-management/scheduler/index.ts

Length of output: 2960


Scrape path bypasses proxy/remote browser config and duplicates logic

The new scrape branch works functionally, but has three real concerns worth addressing:

  1. Proxy and browser config are ignored

    The proxy config is retrieved and configured earlier in this file (lines 37-47), but convertPageToMarkdown and convertPageToHTML launch their own Playwright browser instances with default options. This means:

    • user-specific proxy settings are not applied for scrape runs
    • stealth plugins or other browser configuration used for normal runs are bypassed

    Users who rely on proxies to reach target sites will see scrape runs fail where extract-type robots would succeed.

    Consider passing proxyOptions into these functions or refactoring them to operate on the already-initialized remote browser instead.

  2. Logic is duplicated across multiple files

    The scrape success path (formats handling, serializableOutput shape, socket notifications, webhook payload structure) is nearly identical in:

    • server/src/workflow-management/scheduler/index.ts (lines 254–270)
    • server/src/pgboss-worker.ts (lines 214–232)
    • server/src/api/record.ts (lines 691–709)

    Extracting a shared helper would keep behavior consistent and reduce maintenance risk.

  3. Double failure status updates on error

    When conversion fails, the inner catch (line 328) updates run.status to 'failed' and throws, which is then caught by the outer handler (line 508), updating the status again. This causes duplicate notifications and database writes.

🤖 Prompt for AI Agents
In server/src/workflow-management/scheduler/index.ts around lines 211–359 the
new scrape branch bypasses user proxy/remote-browser config, duplicates
result/notification/webhook logic found in other files, and performs a redundant
run.status update on conversion error; fix by 1) changing the markdown/html
conversion calls to accept and use the existing proxy/browser options (or accept
a browser instance) so they reuse the remote browser/proxy/stealth config
initialized earlier (pass proxyOptions or browser handle into
convertPageToMarkdown/convertPageToHTML), 2) extract the shared success/failure
handling (building serializableOutput, socket notifications, webhook payload and
run.update calls) into a single helper used by scheduler, pgBoss worker and
api/record to avoid duplication, and 3) remove the inner catch’s run.update that
sets status to 'failed' (or convert it to only prepare failure data without
persisting) so that the outer error handler is the single place that updates run
status and emits failure notifications to prevent double updates.

Comment on lines +31 to +59
export const createScrapeRobot = async (
url: string,
name?: string,
formats: string[] = ['markdown']
): Promise<any> => {
try {
const response = await axios.post(
`${apiUrl}/storage/recordings/scrape`,
{
url,
name,
formats,
},
{
headers: { 'Content-Type': 'application/json' },
withCredentials: true,
}
);

if (response.status === 201) {
return response.data;
} else {
throw new Error('Failed to create markdown robot');
}
} catch (error: any) {
console.error('Error creating markdown robot:', error);
return null;
}
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix error message and consider error handling consistency.

The error message on line 53 references "markdown robot" but should reference "scrape robot" for consistency with the function name and broader terminology.

Apply this diff:

     } else {
-      throw new Error('Failed to create markdown robot');
+      throw new Error('Failed to create scrape robot');
     }
   } catch (error: any) {
-    console.error('Error creating markdown robot:', error);
+    console.error('Error creating scrape robot:', error);
     return null;
   }

Additionally, consider if returning null on error is the best approach. Other functions in this file (e.g., getStoredRecordings, getStoredRecording) also return null on error, which is consistent, but this pattern forces callers to check for null. Consider whether throwing errors would provide better error propagation.

🤖 Prompt for AI Agents
In src/api/storage.ts around lines 31 to 59, the error messages refer to
"markdown robot" but the function creates a "scrape robot" — update the thrown
Error and the console.error message to reference "scrape robot" instead;
additionally decide on consistent error handling: either keep returning null on
error to match getStoredRecordings/getStoredRecording (no behavioral change), or
switch to rethrowing the caught error (or throw a new Error with context) so
callers can handle exceptions — implement one approach consistently and update
the function signature/JSdoc/comments accordingly.

notify('success', `${scrapeRobotName} created successfully!`);
navigate('/robots');
} else {
notify('error', 'Failed to create markdown robot');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent error message: references "markdown robot" instead of "scrape robot".

The error message at line 463 says "Failed to create markdown robot" but the Scrape tab supports both markdown and HTML formats. The message should be more generic.

Apply this diff:

-                    notify('error', 'Failed to create markdown robot');
+                    notify('error', 'Failed to create scrape robot');
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
notify('error', 'Failed to create markdown robot');
notify('error', 'Failed to create scrape robot');
🤖 Prompt for AI Agents
In src/components/robot/pages/RobotCreate.tsx around line 463, the notify error
message incorrectly says "Failed to create markdown robot"; update this to a
generic message that covers both markdown and HTML (e.g., "Failed to create
scrape robot" or "Failed to create robot") so the Scrape tab error is accurate;
replace the string in the notify('error', ...) call and run/adjust any tests or
snapshots that assert the old message.

<ListItemIcon>
<Refresh fontSize="small" />
</ListItemIcon>
<ListItemText>Retrain</ListItemText>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use translation keys instead of hardcoded English strings.

The menu item labels were changed from translated strings (using t(...)) to hardcoded English text. This breaks internationalization support that exists throughout the rest of the component.

Apply this diff to restore i18n support:

-            <ListItemText>Retrain</ListItemText>
+            <ListItemText>{t('recordingtable.options.retrain')}</ListItemText>

         <MenuItem onClick={() => { handleEdit(); handleClose(); }}>
           <ListItemIcon><Edit fontSize="small" /></ListItemIcon>
-          <ListItemText>Edit</ListItemText>
+          <ListItemText>{t('recordingtable.options.edit')}</ListItemText>
         </MenuItem>

         <MenuItem onClick={() => { handleDelete(); handleClose(); }}>
           <ListItemIcon><DeleteForever fontSize="small" /></ListItemIcon>
-          <ListItemText>Delete</ListItemText>
+          <ListItemText>{t('recordingtable.options.delete')}</ListItemText>
         </MenuItem>

         {robotType !== 'scrape' && (
           <MenuItem onClick={() => { handleDuplicate(); handleClose(); }}>
             <ListItemIcon><ContentCopy fontSize="small" /></ListItemIcon>
-            <ListItemText>Duplicate</ListItemText>
+            <ListItemText>{t('recordingtable.options.duplicate')}</ListItemText>
           </MenuItem>
         )}

Also applies to: 787-787, 792-792, 798-798

🤖 Prompt for AI Agents
In src/components/robot/RecordingsTable.tsx around lines 781 (and similarly at
lines 787, 792, 798), several MenuItem labels were changed to hardcoded English
strings like "Retrain", breaking i18n; replace those hardcoded strings with the
translation function calls (e.g. t('your.translation.key') for each label),
using the existing translation keys used elsewhere in the component (or add
appropriate new keys) so each MenuItemText uses t(...) instead of a literal
string, and ensure the component imports/has access to the t function from
react-i18next as used elsewhere in this file.

Comment on lines +68 to +86
useEffect(() => {
setMarkdownContent('');
setHtmlContent('');

if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown)) {
const markdownData = row.serializableOutput.markdown[0];
if (markdownData?.content) {
setMarkdownContent(markdownData.content);
}
}

if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html)) {
const htmlData = row.serializableOutput.html[0];
if (htmlData?.content) {
setHtmlContent(htmlData.content);
}
}
}, [row.serializableOutput]);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add safety checks for array access.

The code assumes the markdown and html arrays have at least one element without validation. If the arrays are empty, accessing [0] will return undefined, which could cause issues.

Apply this diff to add safety checks:

   useEffect(() => {
     setMarkdownContent('');
     setHtmlContent('');
 
-    if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown)) {
+    if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown) && row.serializableOutput.markdown.length > 0) {
       const markdownData = row.serializableOutput.markdown[0];
       if (markdownData?.content) {
         setMarkdownContent(markdownData.content);
       }
     }
 
-    if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html)) {
+    if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html) && row.serializableOutput.html.length > 0) {
       const htmlData = row.serializableOutput.html[0];
       if (htmlData?.content) {
         setHtmlContent(htmlData.content);
       }
     }
   }, [row.serializableOutput]);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
useEffect(() => {
setMarkdownContent('');
setHtmlContent('');
if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown)) {
const markdownData = row.serializableOutput.markdown[0];
if (markdownData?.content) {
setMarkdownContent(markdownData.content);
}
}
if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html)) {
const htmlData = row.serializableOutput.html[0];
if (htmlData?.content) {
setHtmlContent(htmlData.content);
}
}
}, [row.serializableOutput]);
useEffect(() => {
setMarkdownContent('');
setHtmlContent('');
if (row.serializableOutput?.markdown && Array.isArray(row.serializableOutput.markdown) && row.serializableOutput.markdown.length > 0) {
const markdownData = row.serializableOutput.markdown[0];
if (markdownData?.content) {
setMarkdownContent(markdownData.content);
}
}
if (row.serializableOutput?.html && Array.isArray(row.serializableOutput.html) && row.serializableOutput.html.length > 0) {
const htmlData = row.serializableOutput.html[0];
if (htmlData?.content) {
setHtmlContent(htmlData.content);
}
}
}, [row.serializableOutput]);
🤖 Prompt for AI Agents
In src/components/run/RunContent.tsx around lines 68 to 86 the effect assumes
row.serializableOutput.markdown and .html arrays contain at least one element
and accesses [0] directly; update both branches to first check that the array
exists and has length > 0 (or use .find to locate the first element with a
content property) before reading [0] or .content, and only call
setMarkdownContent/setHtmlContent when a valid item with .content is found to
avoid undefined access when arrays are empty.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/components/robot/pages/RobotCreate.tsx (1)

464-464: Inconsistent error message: references "markdown robot" instead of "scrape robot".

The error message says "Failed to create markdown robot" but the Scrape tab supports both markdown and HTML formats. The message should be generic.

Apply this diff:

-                    notify('error', 'Failed to create markdown robot');
+                    notify('error', 'Failed to create scrape robot');
🧹 Nitpick comments (2)
src/components/robot/pages/RobotCreate.tsx (2)

403-403: Consider using FormLabel for consistency.

The label uses a plain <p> tag instead of Material-UI's FormLabel component, which would be more consistent with the FormControl usage.

Apply this diff:

-                <FormControl component="fieldset" sx={{ width: '100%', textAlign: 'left' }}>
-                  <p>Output Format (Select at least one)</p>
+                <FormControl component="fieldset" sx={{ width: '100%', textAlign: 'left' }}>
+                  <FormLabel component="legend">Output Format (Select at least one)</FormLabel>

264-271: Consider extracting duplicated button styles.

The button styles are duplicated between the Extract and Scrape tabs. Consider extracting to a shared constant for maintainability.

Example refactor:

const actionButtonStyles = {
  bgcolor: '#ff00c3',
  py: 1.4,
  fontSize: '1rem',
  textTransform: 'none',
  maxWidth: 700,
  borderRadius: 2
};

// Then use: sx={actionButtonStyles}

Also applies to: 468-475

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 467ffe3 and a1b2117.

📒 Files selected for processing (1)
  • src/components/robot/pages/RobotCreate.tsx (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/components/robot/pages/RobotCreate.tsx (2)
src/context/globalInfo.tsx (1)
  • useGlobalInfoStore (189-189)
src/api/storage.ts (1)
  • createScrapeRobot (31-59)
🔇 Additional comments (1)
src/components/robot/pages/RobotCreate.tsx (1)

442-445: Backend validates URL format properly—no action needed.

The frontend validation (line 442-445) only checks for non-empty input, but the backend endpoint at server/src/routes/storage.ts:456-459 validates URL format using the new URL() constructor, which properly parses and rejects invalid URLs. This backend validation is called before any processing occurs, adequately addressing malformed or invalid URL inputs.

@amhsirak amhsirak merged commit a1515c2 into develop Nov 20, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants