Skip to content

Conversation

@RohitR311
Copy link
Contributor

@RohitR311 RohitR311 commented May 6, 2025

What this PR does?

Improves performance of the existing output preview functionality to get faster data previews.

Closes: #567

Summary by CodeRabbit

  • New Features
    • Added the ability to update the limit for list steps within browser steps via the user interface.
    • Introduced enhanced data extraction for list elements, enabling detailed capture of structured list data with field-specific attributes.
    • Enabled real-time synchronization of extracted list data between client and server for improved list step management.
    • Improved capture logs with dynamic tab selection and structured display of captured list data for clearer user insights.

@RohitR311 RohitR311 added Type: Enhancement Improvements to existing features Status: Work In Progess This issue/PR is actively being worked on labels May 6, 2025
@coderabbitai
Copy link

coderabbitai bot commented May 6, 2025

"""

Walkthrough

The changes export the BrowserStep type, add new methods updateListStepLimit and updateListStepData to the BrowserStepsContextType interface, and implement these methods in the BrowserStepsProvider component to update list step properties. The RightSidePanel component uses updateListStepLimit to update the limit of the latest list step during capture. On the server, a private method extractListData was added to the RemoteBrowser class to extract structured list data from a page, with a socket event listener to trigger extraction and emit results. The BrowserWindow component listens for extracted data events to update list step data and emits extraction requests when fields change. The InterpretationLog component now tracks active action IDs and updates capture data based on browser steps, replacing previous serializable callback handling.

Changes

File(s) Change Summary
src/context/browserSteps.tsx Exported BrowserStep type; added and implemented updateListStepLimit and updateListStepData in context interface/provider.
src/components/recorder/RightSidePanel.tsx Imported BrowserStep; used updateListStepLimit to update limit of latest list step during capture.
server/src/browser-management/classes/RemoteBrowser.ts Added private method extractListData for structured list data extraction; added socket event listener for 'extractListData'.
src/components/browser/BrowserWindow.tsx Added socket listener for 'listDataExtracted' to update list step data; changed fields state update to direct object spread; emitted 'extractListData' event before adding list step.
src/components/run/InterpretationLog.tsx Removed serializable callback handling; added active action ID tracking and updated capture data from browser steps; changed capture list table rendering.
server/src/workflow-management/classes/Interpreter.ts Added mode: 'editor' property to options in interpretRecordingInEditor method.
maxun-core/src/interpret.ts Extended InterpreterOptions with optional mode; added early exit for scrapeSchema and scrapeList actions when mode is 'editor'.

Sequence Diagram(s)

sequenceDiagram
    participant RightPanel as RightSidePanel
    participant Provider as BrowserStepsProvider
    participant State as browserSteps State

    RightPanel->>Provider: Call updateListStepLimit(listId, limit)
    Provider->>State: Map steps, update ListStep with matching listId
    State-->>Provider: Return updated steps array
    Provider-->>RightPanel: Context updated, UI reflects new limit
Loading
sequenceDiagram
    participant Client as Editor Client
    participant Server as RemoteBrowser
    participant Page as Playwright Page

    Client->>Server: emit 'extractListData' with listSelector, fields, currentListId
    Server->>Page: extractListData(listSelector, fields, limit)
    Page-->>Server: return extracted list data
    Server->>Client: emit 'listDataExtracted' with data and currentListId
Loading

Assessment against linked issues

Objective Addressed Explanation
Implement mechanism to update ListStep limit (#567)

Suggested reviewers

  • amhsirak

Poem

In the warren of code, a new path appears,
Update your ListStep limits, banish old fears!
Now exported steps hop far and wide,
Context refreshed, with state as our guide.
Data extracted with care from the page,
A rabbit’s delight, setting the stage! 🐇✨
"""


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 817c725 and 38430ec.

📒 Files selected for processing (1)
  • maxun-core/src/interpret.ts (3 hunks)
🔇 Additional comments (3)
maxun-core/src/interpret.ts (3)

40-40: Good addition of the optional 'mode' parameter to the interface.

This new property extends the InterpreterOptions interface with a mode setting that can be used to control interpreter behavior, which aligns well with the PR's goal of improving output preview performance.


437-440: Optimization for editor mode looks good.

This change bypasses the actual scraping logic when in editor mode, immediately returning an empty object instead. This should significantly improve performance in the editor by avoiding unnecessary data processing.


472-475: Similar optimization applied to scrapeList - good consistency.

Applying the same pattern to the scrapeList method ensures consistent behavior across different scraping actions. This early exit pattern will help deliver faster output previews in editor mode.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/context/browserSteps.tsx (1)

146-158: Fix indentation inconsistency and consider edge cases.

The implementation is functional but has inconsistent indentation (2 spaces vs 4 spaces used elsewhere in the file). Also, consider adding validation for the limit parameter.

 const updateListStepLimit = (listId: number, limit: number) => {
     setBrowserSteps(prevSteps =>
-          prevSteps.map(step => {
-            if (step.type === 'list' && step.id === listId) {
-              return {
-                ...step,
-                limit: limit
-              };
-            }
-            return step;
-          })
+        prevSteps.map(step => {
+            if (step.type === 'list' && step.id === listId) {
+                return {
+                    ...step,
+                    limit // object property shorthand
+                };
+            }
+            return step;
+        })
     );
 };
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e62c9a and 4ffce7b.

📒 Files selected for processing (1)
  • src/context/browserSteps.tsx (4 hunks)
🔇 Additional comments (3)
src/context/browserSteps.tsx (3)

29-29: Appropriate export of BrowserStep type.

Exporting this type is a good idea since it will make it accessible to other components that need to work with browser steps.


47-47: Function signature looks good.

The signature is consistent with other context methods and follows the naming convention of the codebase.


184-184: Good addition to context provider value.

The method is correctly added to the context provider value, making it accessible to consumers of the context.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
src/components/recorder/RightSidePanel.tsx (1)

412-412: 🛠️ Refactor suggestion

Update dependency array in useCallback.

The handleConfirmListCapture function now uses the getLatestListStep function and updateListStepLimit, but they aren't included in the dependency array of the useCallback. This could lead to stale closures if these dependencies change.

  }, [captureStage, paginationType, limitType, customLimit, startPaginationMode, setShowPaginationOptions, setCaptureStage, getListSettingsObject, notify, stopPaginationMode, startLimitMode, setShowLimitOptions, stopLimitMode, setIsCaptureListConfirmed, stopCaptureAndEmitGetListSettings, t]);
+  }, [captureStage, paginationType, limitType, customLimit, startPaginationMode, setShowPaginationOptions, setCaptureStage, getListSettingsObject, notify, stopPaginationMode, startLimitMode, setShowLimitOptions, stopLimitMode, setIsCaptureListConfirmed, stopCaptureAndEmitGetListSettings, browserSteps, updateListStepLimit, t]);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ffce7b and 6f047be.

📒 Files selected for processing (1)
  • src/components/recorder/RightSidePanel.tsx (4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/components/recorder/RightSidePanel.tsx (1)
src/context/browserSteps.tsx (2)
  • useBrowserSteps (192-198)
  • BrowserStep (29-29)
🪛 Biome (1.9.4)
src/components/recorder/RightSidePanel.tsx

[error] 396-396: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

🔇 Additional comments (4)
src/components/recorder/RightSidePanel.tsx (4)

10-10: Good addition of the BrowserStep type import.

The BrowserStep type import is necessary for typing the new getLatestListStep function parameter.


72-72: Good inclusion of updateListStepLimit in the hook destructuring.

Properly destructured the updateListStepLimit function from the useBrowserSteps hook to enable updating of list step limits.


352-357: Well-implemented helper function.

The getLatestListStep function is well-designed with:

  • Proper filtering for list-type steps
  • Null-check safety for empty results
  • Clear sorting by descending ID to get the most recent step

This helper keeps the code DRY by extracting the list step finding logic.


396-399: Good implementation of limit updating for the latest list step.

This code efficiently:

  1. Finds the most recent list step using the new helper function
  2. Updates its limit property with the parsed limit value
  3. Does this before completing the capture process

This enhancement improves the responsiveness of the output preview by immediately updating the list step limit.

🧰 Tools
🪛 Biome (1.9.4)

[error] 396-396: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

Comment on lines +395 to +400

const latestListStep = getLatestListStep(browserSteps);
if (latestListStep) {
updateListStepLimit(latestListStep.id, limit);
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the switch case declaration scope issue.

The static analysis tool correctly identifies a potential issue with the declaration scope in the switch statement. Variables declared in a switch case without a block are accessible in other cases, which could lead to unintended behavior.

Apply this fix:

      case 'limit':
        if (!limitType || (limitType === 'custom' && !customLimit)) {
          notify('error', t('right_panel.errors.select_limit'));
          return;
        }
        const limit = limitType === 'custom' ? parseInt(customLimit) : parseInt(limitType);
        if (isNaN(limit) || limit <= 0) {
          notify('error', t('right_panel.errors.invalid_limit'));
          return;
        }

+       {
          const latestListStep = getLatestListStep(browserSteps);
          if (latestListStep) {
            updateListStepLimit(latestListStep.id, limit);
          }
+       }

        stopLimitMode();
        setShowLimitOptions(false);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const latestListStep = getLatestListStep(browserSteps);
if (latestListStep) {
updateListStepLimit(latestListStep.id, limit);
}
{
const latestListStep = getLatestListStep(browserSteps);
if (latestListStep) {
updateListStepLimit(latestListStep.id, limit);
}
}
🧰 Tools
🪛 Biome (1.9.4)

[error] 396-396: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
server/src/browser-management/classes/RemoteBrowser.ts (3)

727-765: findSimilarElements performs a full-document scan – risk of O(N²) on large pages

The helper queries all elements of the same tag across:

• main document
• every shadow DOM
• every iframe / frame

and then runs a class-set similarity comparison on each one.
On long feeds (think social networks) this can easily touch tens of thousands of nodes and freeze the browser.

Possible optimisations:

  1. Bail out early once containers.length ≥ limit.
  2. Restrict the search scope (e.g. sibling or parent section).
  3. Cache classList → element arrays per tag to avoid repeated DOM walks.

Worth profiling with DevTools’ Performance tab before shipping.


543-547: Nit: remove Biome-flagged unnecessary continue

Static analysis (noUnnecessaryContinue) spotted a continue that immediately jumps to the next loop iteration although no code follows it.
Dropping it makes the intent clearer and removes dead control flow.

🧰 Tools
🪛 Biome (1.9.4)

[error] 545-545: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)


448-449: Magic number 5 for default limit

The hard-coded default limit (limit: number = 5) is sprinkled in multiple places.
Consider elevating it to a named constant (DEFAULT_LIST_LIMIT) to convey semantics and keep eventual tweaks centralised.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f047be and 1022ca6.

📒 Files selected for processing (1)
  • server/src/browser-management/classes/RemoteBrowser.ts (2 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
server/src/browser-management/classes/RemoteBrowser.ts

[error] 545-545: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

Comment on lines +430 to +960
/**
* Extract data from a list of elements on a page
* @param page - Playwright Page object
* @param listSelector - CSS selector for the list container
* @param fields - Record of field configurations
* @param limit - Maximum number of items to extract (default: 5)
* @returns Promise<Array<Record<string, string>>> - Array of extracted data objects
*/
private async extractListData(
page: Page,
listSelector: string,
fields: Record<string, {
label: string;
selectorObj: {
selector: string;
attribute: string;
};
}>,
limit: number = 5
): Promise<Array<Record<string, string>>> {
return await page.evaluate(
async ({ listSelector, fields, limit }: {
listSelector: string;
fields: Record<string, {
label: string;
selectorObj: {
selector: string;
attribute: string;
};
}>;
limit: number;
}) => {
const convertedFields: Record<string, {
selector: string;
attribute: string;
}> = {};

for (const [key, field] of Object.entries(fields)) {
convertedFields[field.label] = {
selector: field.selectorObj.selector,
attribute: field.selectorObj.attribute
};
}

const queryElement = (rootElement: Element | Document, selector: string): Element | null => {
if (!selector.includes('>>') && !selector.includes(':>>')) {
return rootElement.querySelector(selector);
}

const parts = selector.split(/(?:>>|:>>)/).map(part => part.trim());
let currentElement: Element | Document | null = rootElement;

for (let i = 0; i < parts.length; i++) {
if (!currentElement) return null;

if ((currentElement as Element).tagName === 'IFRAME' || (currentElement as Element).tagName === 'FRAME') {
try {
const frameElement = currentElement as HTMLIFrameElement | HTMLFrameElement;
const frameDoc = frameElement.contentDocument || frameElement.contentWindow?.document;
if (!frameDoc) return null;
currentElement = frameDoc.querySelector(parts[i]);
continue;
} catch (e) {
console.warn(`Cannot access ${(currentElement as Element).tagName.toLowerCase()} content:`, e);
return null;
}
}

let nextElement: Element | null = null;

if ('querySelector' in currentElement) {
nextElement = currentElement.querySelector(parts[i]);
}

if (!nextElement && 'shadowRoot' in currentElement && (currentElement as Element).shadowRoot) {
nextElement = (currentElement as Element).shadowRoot!.querySelector(parts[i]);
}

if (!nextElement && 'children' in currentElement) {
const children: any = Array.from((currentElement as Element).children || []);
for (const child of children) {
if (child.shadowRoot) {
nextElement = child.shadowRoot.querySelector(parts[i]);
if (nextElement) break;
}
}
}

currentElement = nextElement;
}

return currentElement as Element | null;
};

const queryElementAll = (rootElement: Element | Document, selector: string): Element[] => {
if (!selector.includes('>>') && !selector.includes(':>>')) {
return Array.from(rootElement.querySelectorAll(selector));
}

const parts = selector.split(/(?:>>|:>>)/).map(part => part.trim());
let currentElements: (Element | Document)[] = [rootElement];

for (const part of parts) {
const nextElements: Element[] = [];

for (const element of currentElements) {
if ((element as Element).tagName === 'IFRAME' || (element as Element).tagName === 'FRAME') {
try {
const frameElement = element as HTMLIFrameElement | HTMLFrameElement;
const frameDoc = frameElement.contentDocument || frameElement.contentWindow?.document;
if (frameDoc) {
nextElements.push(...Array.from(frameDoc.querySelectorAll(part)));
}
} catch (e) {
console.warn(`Cannot access ${(element as Element).tagName.toLowerCase()} content:`, e);
continue;
}
} else {
if ('querySelectorAll' in element) {
nextElements.push(...Array.from(element.querySelectorAll(part)));
}

if ('shadowRoot' in element && (element as Element).shadowRoot) {
nextElements.push(...Array.from((element as Element).shadowRoot!.querySelectorAll(part)));
}

if ('children' in element) {
const children = Array.from((element as Element).children || []);
for (const child of children) {
if (child.shadowRoot) {
nextElements.push(...Array.from(child.shadowRoot.querySelectorAll(part)));
}
}
}
}
}

currentElements = nextElements;
}

return currentElements as Element[];
};

function extractValue(element: Element, attribute: string): string | null {
if (!element) return null;

const baseURL = element.ownerDocument?.location?.href || window.location.origin;

if (element.shadowRoot) {
const shadowContent = element.shadowRoot.textContent;
if (shadowContent?.trim()) {
return shadowContent.trim();
}
}

if (attribute === 'innerText') {
return (element as HTMLElement).innerText.trim();
} else if (attribute === 'innerHTML') {
return element.innerHTML.trim();
} else if (attribute === 'src' || attribute === 'href') {
if (attribute === 'href' && element.tagName !== 'A') {
const parentElement = element.parentElement;
if (parentElement && parentElement.tagName === 'A') {
const parentHref = parentElement.getAttribute('href');
if (parentHref) {
try {
return new URL(parentHref, baseURL).href;
} catch (e) {
return parentHref;
}
}
}
}

const attrValue = element.getAttribute(attribute);
const dataAttr = attrValue || element.getAttribute('data-' + attribute);

if (!dataAttr || dataAttr.trim() === '') {
if (attribute === 'src') {
const style = window.getComputedStyle(element);
const bgImage = style.backgroundImage;
if (bgImage && bgImage !== 'none') {
const matches = bgImage.match(/url\(['"]?([^'")]+)['"]?\)/);
return matches ? new URL(matches[1], baseURL).href : null;
}
}
return null;
}

try {
return new URL(dataAttr, baseURL).href;
} catch (e) {
console.warn('Error creating URL from', dataAttr, e);
return dataAttr; // Return the original value if URL construction fails
}
}
return element.getAttribute(attribute);
}

function findTableAncestor(element: Element): { type: string; element: Element } | null {
let currentElement: Element | null = element;
const MAX_DEPTH = 5;
let depth = 0;

while (currentElement && depth < MAX_DEPTH) {
if (currentElement.getRootNode() instanceof ShadowRoot) {
currentElement = (currentElement.getRootNode() as ShadowRoot).host;
continue;
}

if (currentElement.tagName === 'TD') {
return { type: 'TD', element: currentElement };
} else if (currentElement.tagName === 'TR') {
return { type: 'TR', element: currentElement };
}

if (currentElement.tagName === 'IFRAME' || currentElement.tagName === 'FRAME') {
try {
const frameElement = currentElement as HTMLIFrameElement | HTMLFrameElement;
currentElement = frameElement.contentDocument?.body || null;
} catch (e) {
return null;
}
} else {
currentElement = currentElement.parentElement;
}
depth++;
}
return null;
}

function getCellIndex(td: Element): number {
if (td.getRootNode() instanceof ShadowRoot) {
const shadowRoot = td.getRootNode() as ShadowRoot;
const allCells = Array.from(shadowRoot.querySelectorAll('td'));
return allCells.indexOf(td as HTMLTableCellElement);
}

let index = 0;
let sibling = td;
while (sibling = sibling.previousElementSibling as Element) {
index++;
}
return index;
}

function hasThElement(row: Element, tableFields: Record<string, { selector: string; attribute: string }>): boolean {
for (const [_, { selector }] of Object.entries(tableFields)) {
const element = queryElement(row, selector);
if (element) {
let current: Element | ShadowRoot | Document | null = element;
while (current && current !== row) {
if (current.getRootNode() instanceof ShadowRoot) {
current = (current.getRootNode() as ShadowRoot).host;
continue;
}

if ((current as Element).tagName === 'TH') return true;

if ((current as Element).tagName === 'IFRAME' || (current as Element).tagName === 'FRAME') {
try {
const frameElement = current as HTMLIFrameElement | HTMLFrameElement;
current = frameElement.contentDocument?.body || null;
} catch (e) {
break;
}
} else {
current = (current as Element).parentElement;
}
}
}
}
return false;
}

function filterRowsBasedOnTag(rows: Element[], tableFields: Record<string, { selector: string; attribute: string }>): Element[] {
for (const row of rows) {
if (hasThElement(row, tableFields)) {
return rows;
}
}
return rows.filter(row => {
const directTH = row.getElementsByTagName('TH').length === 0;
const shadowTH = row.shadowRoot ?
row.shadowRoot.querySelector('th') === null : true;
return directTH && shadowTH;
});
}

function calculateClassSimilarity(classList1: string[], classList2: string[]): number {
const set1 = new Set(classList1);
const set2 = new Set(classList2);
const intersection = new Set([...set1].filter(x => set2.has(x)));
const union = new Set([...set1, ...set2]);
return intersection.size / union.size;
}

function findSimilarElements(baseElement: Element, similarityThreshold: number = 0.7): Element[] {
const baseClasses = Array.from(baseElement.classList);
if (baseClasses.length === 0) return [];

const allElements: Element[] = [];

allElements.push(...Array.from(document.getElementsByTagName(baseElement.tagName)));

if (baseElement.getRootNode() instanceof ShadowRoot) {
const shadowHost = (baseElement.getRootNode() as ShadowRoot).host;
allElements.push(...Array.from(shadowHost.getElementsByTagName(baseElement.tagName)));
}

const frames = [
...Array.from(document.getElementsByTagName('iframe')),
...Array.from(document.getElementsByTagName('frame'))
];

for (const frame of frames) {
try {
const frameElement = frame as HTMLIFrameElement | HTMLFrameElement;
const frameDoc = frameElement.contentDocument || frameElement.contentWindow?.document;
if (frameDoc) {
allElements.push(...Array.from(frameDoc.getElementsByTagName(baseElement.tagName)));
}
} catch (e) {
console.warn(`Cannot access ${frame.tagName.toLowerCase()} content:`, e);
}
}

return allElements.filter(element => {
if (element === baseElement) return false;
const similarity = calculateClassSimilarity(
baseClasses,
Array.from(element.classList)
);
return similarity >= similarityThreshold;
});
}

let containers = queryElementAll(document, listSelector);

if (containers.length === 0) return [];

if (limit > 1 && containers.length === 1) {
const baseContainer = containers[0];
const similarContainers = findSimilarElements(baseContainer);

if (similarContainers.length > 0) {
const newContainers = similarContainers.filter(container =>
!container.matches(listSelector)
);
containers = [...containers, ...newContainers];
}
}

const containerFields = containers.map(() => ({
tableFields: {} as Record<string, {
selector: string;
attribute: string;
tableContext?: string;
cellIndex?: number;
}>,
nonTableFields: {} as Record<string, {
selector: string;
attribute: string;
}>
}));

containers.forEach((container, containerIndex) => {
for (const [label, field] of Object.entries(convertedFields)) {
const sampleElement = queryElement(container, field.selector);

if (sampleElement) {
const ancestor = findTableAncestor(sampleElement);
if (ancestor) {
containerFields[containerIndex].tableFields[label] = {
...field,
tableContext: ancestor.type,
cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
};
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
}
});

const tableData: Array<Record<string, string>> = [];
const nonTableData: Array<Record<string, string>> = [];

for (let containerIndex = 0; containerIndex < containers.length; containerIndex++) {
const container = containers[containerIndex];
const { tableFields } = containerFields[containerIndex];

if (Object.keys(tableFields).length > 0) {
const firstField = Object.values(tableFields)[0];
const firstElement = queryElement(container, firstField.selector);
let tableContext: Element | null = firstElement;

while (tableContext && tableContext.tagName !== 'TABLE' && tableContext !== container) {
if (tableContext.getRootNode() instanceof ShadowRoot) {
tableContext = (tableContext.getRootNode() as ShadowRoot).host;
continue;
}

if (tableContext.tagName === 'IFRAME' || tableContext.tagName === 'FRAME') {
try {
const frameElement = tableContext as HTMLIFrameElement | HTMLFrameElement;
tableContext = frameElement.contentDocument?.body || null;
} catch (e) {
break;
}
} else {
tableContext = tableContext.parentElement;
}
}

if (tableContext) {
const rows: Element[] = [];

rows.push(...Array.from(tableContext.getElementsByTagName('TR')));

if (tableContext.tagName === 'IFRAME' || tableContext.tagName === 'FRAME') {
try {
const frameElement = tableContext as HTMLIFrameElement | HTMLFrameElement;
const frameDoc = frameElement.contentDocument || frameElement.contentWindow?.document;
if (frameDoc) {
rows.push(...Array.from(frameDoc.getElementsByTagName('TR')));
}
} catch (e) {
console.warn(`Cannot access ${tableContext.tagName.toLowerCase()} rows:`, e);
}
}

const processedRows = filterRowsBasedOnTag(rows, tableFields);

for (let rowIndex = 0; rowIndex < Math.min(processedRows.length, limit); rowIndex++) {
const record: Record<string, string> = {};
const currentRow = processedRows[rowIndex];

for (const [label, { selector, attribute, cellIndex }] of Object.entries(tableFields)) {
let element: Element | null = null;

if (cellIndex !== undefined && cellIndex >= 0) {
let td: Element | null = currentRow.children[cellIndex] || null;

if (!td && currentRow.shadowRoot) {
const shadowCells = currentRow.shadowRoot.children;
if (shadowCells && shadowCells.length > cellIndex) {
td = shadowCells[cellIndex];
}
}

if (td) {
element = queryElement(td, selector);

if (!element && selector.split(/(?:>>|:>>)/).pop()?.includes('td:nth-child')) {
element = td;
}

if (!element) {
const tagOnlySelector = selector.split('.')[0];
element = queryElement(td, tagOnlySelector);
}

if (!element) {
let currentElement: Element | null = td;
while (currentElement && currentElement.children.length > 0) {
let foundContentChild = false;
for (const child of Array.from(currentElement.children)) {
if (extractValue(child, attribute)) {
currentElement = child;
foundContentChild = true;
break;
}
}
if (!foundContentChild) break;
}
element = currentElement;
}
}
} else {
element = queryElement(currentRow, selector);
}

if (element) {
const value = extractValue(element, attribute);
if (value !== null) {
record[label] = value;
}
}
}

if (Object.keys(record).length > 0) {
tableData.push(record);
}
}
}
}
}

for (let containerIndex = 0; containerIndex < containers.length; containerIndex++) {
if (nonTableData.length >= limit) break;

const container = containers[containerIndex];
const { nonTableFields } = containerFields[containerIndex];

if (Object.keys(nonTableFields).length > 0) {
const record: Record<string, string> = {};

for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
const relativeSelector = selector.split(/(?:>>|:>>)/).slice(-1)[0];
const element = queryElement(container, relativeSelector);

if (element) {
const value = extractValue(element, attribute);
if (value !== null) {
record[label] = value;
}
}
}

if (Object.keys(record).length > 0) {
nonTableData.push(record);
}
}
}

const scrapedData = [...tableData, ...nonTableData].slice(0, limit);
return scrapedData;
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

extractListData is a 500-line page.evaluate — unmaintainable & hard to test

Embedding such a large, multi-purpose scraping engine inside page.evaluate has several downsides:

  1. Readability/maintainability: one method now accounts for ~45 % of the file.
  2. No type safety or linting inside the browser context; runtime errors surface only in production.
  3. Hot-reloading & coverage tools can’t exercise the in-page code.
  4. Bundle size & perf: the whole string is sent over CDP on every call.

Recommended refactor (can be incremental):

• Move the browser-side code to a dedicated script (e.g. extractListData.browser.ts) and load it once with page.addInitScript / evaluateOnNewDocument.
• Keep a thin wrapper here that forwards parameters and returns the result.
• Write unit / integration tests for the extraction logic in isolation.

This will dramatically improve readability, enable proper TS checks, and reduce wire traffic.

🧰 Tools
🪛 Biome (1.9.4)

[error] 545-545: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

Comment on lines +1065 to +1083
this.socket.on('extractListData', async (data: {
listSelector: string,
fields: Record<string, any>,
currentListId: number,
pagination: any
}) => {
if (this.currentPage) {
const extractedData = await this.extractListData(
this.currentPage,
data.listSelector,
data.fields
);

this.socket.emit('listDataExtracted', {
currentListId: data.currentListId,
data: extractedData
});
}
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing limit propagation breaks the UI-requested page size

The client is already sending a pagination object, yet the listener discards it and always calls extractListData with the default limit of 5.
Users who change the limit in the UI will therefore see silently truncated results, defeating the very “faster output preview” goal of this PR.
Forward the requested limit (and consider making pagination optional in the method’s signature).

-                const extractedData = await this.extractListData(
-                    this.currentPage, 
-                    data.listSelector, 
-                    data.fields
-                );
+                const extractedData = await this.extractListData(
+                    this.currentPage,
+                    data.listSelector,
+                    data.fields,
+                    // Default to 5 only when the caller provides nothing
+                    data.pagination?.limit ?? 5
+                );

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/components/browser/BrowserWindow.tsx (1)

255-267: Consider adding error handling for socket operations.

The implementation of socket event listeners and emitters lacks error handling. Consider adding error handling to gracefully manage connection issues or failed operations.

// For the socket event listener
socket.on('listDataExtracted', (response) => {
+  try {
     const { currentListId, data } = response;
     updateListStepData(currentListId, data);
+  } catch (error) {
+    console.error('Error processing list data:', error);
+    // Optionally notify the user
+    notify('error', 'Failed to process list data');
+  }
});

// For socket emissions
+try {
  socket?.emit('extractListData', {
    listSelector,
    fields: updatedFields,
    currentListId,
    pagination: { type: '', selector: paginationSelector }
  });
+} catch (error) {
+  console.error('Error sending list data for extraction:', error);
+  // Optionally notify the user
+  notify('error', 'Failed to send data for extraction');
+}

Also applies to: 404-410, 476-482

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1022ca6 and f43b860.

📒 Files selected for processing (1)
  • src/components/browser/BrowserWindow.tsx (4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/components/browser/BrowserWindow.tsx (1)
src/context/browserSteps.tsx (1)
  • useBrowserSteps (192-198)
🔇 Additional comments (4)
src/components/browser/BrowserWindow.tsx (4)

90-90: LGTM: Hook updated to include the new updateListStepData method.

The addition of updateListStepData to the destructured methods from useBrowserSteps is appropriate for the new functionality being implemented.


255-267: Socket event listener for real-time list data updates looks good.

This new useEffect properly handles the socket event lifecycle by adding a listener for 'listDataExtracted' and cleaning it up on unmount. This implementation enables real-time updates of list data from the server.


404-410: New socket emission for list data extraction looks good.

This addition enables the server to extract list data based on the current state, which is part of the "faster output preview" feature. The data is sent before adding the list step, which ensures synchronization between client and server.


476-482: New socket emission for list data extraction looks good.

This addition for the attribute selection case correctly emits the 'extractListData' event with all necessary data to update the server-side state.

Comment on lines +468 to +473
const updatedFields = {
...fields,
[newField.id]: newField
};

setFields(updatedFields);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using functional state updates to prevent race conditions.

Similar to the earlier instance, the state update has been changed from a functional update pattern to directly spreading the current state, which could lead to race conditions.

-const updatedFields = {
-  ...fields,
-  [newField.id]: newField
-};
-
-setFields(updatedFields);
+setFields(prevFields => ({
+  ...prevFields,
+  [newField.id]: newField
+}));

// Then use the callback pattern when passing to socket.emit, similar to the earlier suggestion

Comment on lines +396 to +401
const updatedFields = {
...fields,
[newField.id]: newField
};

setFields(updatedFields);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using functional state updates to prevent race conditions.

The state update has been changed from a functional update pattern to directly spreading the current state. This approach could lead to race conditions if multiple state updates occur concurrently.

-const updatedFields = {
-  ...fields,
-  [newField.id]: newField
-};
-
-setFields(updatedFields);
+setFields(prevFields => ({
+  ...prevFields,
+  [newField.id]: newField
+}));

// Then use the callback pattern when passing to socket.emit
+const getUpdatedFields = () => ({
+  ...fields,
+  [newField.id]: newField
+});

// Later in the code:
+socket?.emit('extractListData', {
+  listSelector,
+  fields: getUpdatedFields(),
+  currentListId,
+  pagination: { type: '', selector: paginationSelector }
+});

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/components/run/InterpretationLog.tsx (1)

341-382: Potential mismatch between header labels and row keys

row[field.label] assumes that every data row uses the human-readable label as its object key.
If the backend sends canonical keys (e.g., field.name / field.id) this will render empty cells.

Consider using the field’s canonical identifier when indexing:

- {row[field.label]}
+ {row[field.key ?? field.name ?? field.label]}

…and ensure fields includes that identifier.

Also, Object.values(captureListData[captureListPage]?.fields || {}) does not guarantee column order.
If the backend provides an ordered array, prefer that over Object.values.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec36df7 and 0a1704e.

📒 Files selected for processing (1)
  • src/components/run/InterpretationLog.tsx (7 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/components/run/InterpretationLog.tsx (1)
src/context/browserSteps.tsx (1)
  • useBrowserSteps (208-214)

Comment on lines +101 to 112
const updateActiveTab = useCallback(() => {
const availableTabs = getAvailableTabs();

if (captureListData.length > 0 && availableTabs.findIndex(tab => tab.id === 'captureList') !== -1) {
setActiveTab(availableTabs.findIndex(tab => tab.id === 'captureList'));
} else if (captureTextData.length > 0 && availableTabs.findIndex(tab => tab.id === 'captureText') !== -1) {
setActiveTab(availableTabs.findIndex(tab => tab.id === 'captureText'));
} else if (screenshotData.length > 0 && availableTabs.findIndex(tab => tab.id === 'captureScreenshot') !== -1) {
setActiveTab(availableTabs.findIndex(tab => tab.id === 'captureScreenshot'));
}
}, [captureListData.length, captureTextData.length, screenshotData.length]);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

updateActiveTab never re-evaluates when its inputs change

Because updateActiveTab is memoised with useCallback, it will not run automatically once captureListData / captureTextData / screenshotData are updated; it is only executed manually from the earlier effect (which itself suffers from timing issues).

A safer pattern is:

// Remove the useCallback wrapper altogether
const updateActiveTab = () => { ... };

// Recalculate whenever any of the data arrays change.
useEffect(updateActiveTab, [
  captureListData.length,
  captureTextData.length,
  screenshotData.length,
]);

This guarantees the visible tab always reflects the latest available datasets.

Comment on lines +79 to +99
useEffect(() => {
if (activeActionId !== null) {
const textSteps = browserSteps.filter(step => step.type === 'text');
if (textSteps.length > 0) {
const textDataRow: Record<string, string> = {};
textSteps.forEach(step => {
textDataRow[step.label] = step.data;
});

setCaptureTextData([textDataRow]);
}
} else if (type === 'captureText') {
if (Array.isArray(data)) {
setCaptureTextData(data);
} else {
setCaptureTextData([data]);
}
if (captureTextData.length === 0) {
const availableTabs = getAvailableTabs();
const tabIndex = availableTabs.findIndex(tab => tab.id === 'captureText');
if (tabIndex !== -1) setActiveTab(tabIndex);

const listSteps = browserSteps.filter(step => step.type === 'list');
if (listSteps.length > 0) {
setCaptureListData(listSteps);
}

updateActiveTab();
}

scrollLogToBottom();
}, [captureListData.length, captureTextData.length, t]);
}, [activeActionId, browserSteps, t]);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Filter browserSteps by the active action ID and defer tab–selection until after state updates

activeActionId is only used to trigger this useEffect, yet it’s never employed in the filtering logic.
Consequently, every list/text step across all actions is shown whenever any action becomes active.

In addition, updateActiveTab() is invoked before setCaptureListData / setCaptureTextData complete, so the tab can be computed from stale (still-empty) arrays.

-const textSteps = browserSteps.filter(step => step.type === 'text');
+const textSteps = browserSteps.filter(
+  step => step.type === 'text' && step.actionId === activeActionId
+);

 ...
-const listSteps = browserSteps.filter(step => step.type === 'list');
+const listSteps = browserSteps.filter(
+  step => step.type === 'list' && step.actionId === activeActionId
+);

-  updateActiveTab();
+  // Let the state setters finish first; a separate effect will switch tabs.

Follow-up: add a second useEffect that watches [captureListData, captureTextData, screenshotData] and calls updateActiveTab() there.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
useEffect(() => {
if (activeActionId !== null) {
const textSteps = browserSteps.filter(step => step.type === 'text');
if (textSteps.length > 0) {
const textDataRow: Record<string, string> = {};
textSteps.forEach(step => {
textDataRow[step.label] = step.data;
});
setCaptureTextData([textDataRow]);
}
} else if (type === 'captureText') {
if (Array.isArray(data)) {
setCaptureTextData(data);
} else {
setCaptureTextData([data]);
}
if (captureTextData.length === 0) {
const availableTabs = getAvailableTabs();
const tabIndex = availableTabs.findIndex(tab => tab.id === 'captureText');
if (tabIndex !== -1) setActiveTab(tabIndex);
const listSteps = browserSteps.filter(step => step.type === 'list');
if (listSteps.length > 0) {
setCaptureListData(listSteps);
}
updateActiveTab();
}
scrollLogToBottom();
}, [captureListData.length, captureTextData.length, t]);
}, [activeActionId, browserSteps, t]);
useEffect(() => {
if (activeActionId !== null) {
const textSteps = browserSteps.filter(
step => step.type === 'text' && step.actionId === activeActionId
);
if (textSteps.length > 0) {
const textDataRow: Record<string, string> = {};
textSteps.forEach(step => {
textDataRow[step.label] = step.data;
});
setCaptureTextData([textDataRow]);
}
const listSteps = browserSteps.filter(
step => step.type === 'list' && step.actionId === activeActionId
);
if (listSteps.length > 0) {
setCaptureListData(listSteps);
}
// Let the state setters finish first; a separate effect will switch tabs.
}
}, [activeActionId, browserSteps, t]);

@amhsirak amhsirak removed the Status: Work In Progess This issue/PR is actively being worked on label May 7, 2025
@amhsirak amhsirak merged commit 8c1e9f3 into develop May 8, 2025
1 check passed
@amhsirak amhsirak deleted the faster-preview branch September 18, 2025 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Type: Enhancement Improvements to existing features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feat: Faster output preview data

3 participants