feat: support `iframe` extraction #275

amhsirak · 2024-12-20T13:02:55Z

Summary by CodeRabbit

New Features
- Enhanced iframe scraping capabilities across multiple components.
- Improved selector handling for complex web page structures.
- Added support for traversing and extracting data from nested iframes.
- Introduced new filtering for selectors to refine workflow execution.
Bug Fixes
- Resolved issues with URL resolution and element retrieval in iframe contexts.
- Improved cross-origin access error handling during web scraping.
Refactor
- Significant updates to selector and scraping logic.
- Enhanced type definitions for more robust iframe selector management.

coderabbitai · 2024-12-20T13:03:05Z

Walkthrough

This pull request introduces comprehensive enhancements for handling iframes across multiple components of the application. The changes span browser-side scraping, server-side selector management, and type definitions, focusing on improving the ability to traverse and extract data from complex DOM structures containing iframes and shadow DOMs. The modifications introduce new methods for iframe selector handling, element retrieval, and cross-origin access management, creating a more robust mechanism for interacting with nested iframe content.

Changes

File	Change Summary
`maxun-core/src/browserSide/scraper.js`	Enhanced `findAllElements` for iframe and shadow DOM support, modified `getElementValue` for relative URL resolution, refactored `scrapeSchema` and `scrapeList` for better data gathering.
`maxun-core/src/interpret.ts`	Added `removeSpecialSelectors` method to filter out selectors with `:>>` or `>>`, updated `runLoop` to utilize this new method.
`server/src/types/index.ts`	Introduced `IframeSelector` interface, updated `Selectors` to include `iframeSelector` and allow null for `shadowSelector`.
`server/src/workflow-management/selector.ts`	Added `getDeepestElementFromPoint` for improved element retrieval in nested iframes, expanded element information with iframe context details.
`server/src/workflow-management/utils.ts`	Updated `getBestSelectorForAction` to prioritize `iframeSelector`.
`src/components/organisms/BrowserWindow.tsx`	Added `isIframeContent` property to `ElementInfo`, modified `highlighterHandler` to manage iframe content.

Sequence Diagram

sequenceDiagram
    participant Browser
    participant Scraper
    participant Interpreter
    participant Selector

    Browser->>Scraper: Initiate scraping
    Scraper->>Scraper: Find elements with iframe selectors
    Scraper->>Interpreter: Process workflow
    Interpreter->>Interpreter: Remove special selectors
    Interpreter->>Selector: Retrieve element information
    Selector->>Selector: Traverse nested iframes
    Selector-->>Interpreter: Return element details
    Interpreter-->>Scraper: Provide processed data
    Scraper-->>Browser: Return scraped content

Possibly related PRs

feat: shadow dom selection #296: The changes in this PR enhance the scraper.js file to improve scraping capabilities for elements within shadow DOMs, which directly relates to the modifications made in the main PR that also focus on handling shadow DOM contexts.

Suggested Labels

Status: In Review

Poem

🐰 Hopping through iframes with glee,
Selectors dancing, wild and free!
Nested content, no longer a maze,
Our scraper leaps through browsing ways.
Code rabbit's magic makes data sing! 🌐

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

amhsirak

@RohitR311 resolve merge conflicts

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (17)

src/components/organisms/BrowserWindow.tsx (4)

16-16: Add JSDoc comment for clarity.
Consider adding a short JSDoc comment explaining the purpose of the isIframeContent property. This helps maintainers and IDEs quickly understand why an element might be flagged as iframe content.

121-123: Replace console.log with a proper logger or remove debug statements.
Leaving plain console logs in production code can clutter the console and potentially leak internal details.

132-132: Clarify the code comment.
Ensure that the comment precisely reflects the condition. It might be helpful to mention the effect or purpose of skipping the types 'none', 'scrollDown', and 'scrollUp'.

141-161: Validate iframe logic for correctness.
This block introduces sophisticated checks for iframe selectors and nested DOM structures. It appears logically sound for capturing iframe child elements. However, consider extracting repeated checks (e.g., data.selector.includes(':>>')) into a dedicated utility function for readability and maintainability.
maxun-core/src/browserSide/scraper.js (6)
191-196: Refactor suggestion for early returns.
In window.scrapeSchema, the code seems readable, but consider placing utility logic in separate helper functions to keep the main function concise.

Also applies to: 203-203

248-248: Unnecessary continue statement.
As flagged by static analysis, the continue; statement might not be needed. Removing it will not change the logic because there's no subsequent code in this loop iteration.
-            continue;
🧰 Tools

🪛 Biome (1.9.4)

[error] 248-248: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

284-355: Efficient fallback approach.
This fallback logic for scrapeSchema is a useful safeguard when MBE-based scraping misses elements. The code is large; a future refactoring might improve readability.

367-427: Good approach for iframe querying in scrapeList.
The new queryIframe and queryIframeAll functions are well-structured. This enables robust element retrieval within nested iframes. Consider adding more unit tests to ensure edge cases (deeply nested iframes, cross-origin frames) are handled.

🧰 Tools

🪛 Biome (1.9.4)

[error] 416-416: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

Line range hint 482-515: Check for performance overhead in deeper iframes.
Looping through each child and verifying table or TH elements might become costly if the list is large. If performance is a concern, caching or short-circuiting might help.

416-416: Unnecessary continue statement.
As per static analysis, removing this statement can simplify the loop.
-        continue;
🧰 Tools

🪛 Biome (1.9.4)

[error] 416-416: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)
server/src/workflow-management/selector.ts (5)
484-528: Similar duplication.
The repeated getDeepestElementFromPoint logic might be consolidated. Repeated code can drift out of sync.

1796-1797: Streamline code.
Static analysis flagged this as an unnecessary continue;. Removing it cleans up the loop.
-            continue;
🧰 Tools

🪛 Biome (1.9.4)

[error] 1796-1797: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

1828-1828: Remove superfluous continue.
Similar to the previous case, removing this continues to yield the same flow.
-        continue;
🧰 Tools

🪛 Biome (1.9.4)

[error] 1828-1828: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

1845-1846: Another unnecessary continue.
Following the same pattern, consider removing for clarity.
-            continue;
🧰 Tools

🪛 Biome (1.9.4)

[error] 1845-1846: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

1874-1874: Unneeded continue statement.
Same reasoning; removing it doesn’t affect the loop logic.
-                continue;
🧰 Tools

🪛 Biome (1.9.4)

[error] 1874-1874: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)
maxun-core/src/interpret.ts (2)

669-681: Ensure robust filtering and coverage for iframe-like notations.
Currently, the removeIframeSelectors method filters out selectors containing ":>>". This achieves the immediate goal of excluding iframe selectors, but consider other potential iframe-like notations or future expansions (e.g., stricter substring checks, case-sensitivity, or partial usage like :>) if the codebase or requirements might evolve. Adding unit tests to verify all expected and edge-case notations can provide additional confidence.

684-686: Consider using a more modern or type-safe cloning approach.
Using JSON.parse(JSON.stringify(workflow)) may lead to the loss of any non-serializable data (e.g., dates, custom classes). If your workflow includes such fields in the future, you could consider using structuredClone (in environments that support it) or another library-based deep-cloning method to preserve object fidelity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 170e4f1 and 5e775fa.

📒 Files selected for processing (6)

maxun-core/src/browserSide/scraper.js (7 hunks)
maxun-core/src/interpret.ts (1 hunks)
server/src/types/index.ts (3 hunks)
server/src/workflow-management/selector.ts (15 hunks)
server/src/workflow-management/utils.ts (1 hunks)
src/components/organisms/BrowserWindow.tsx (2 hunks)

🧰 Additional context used

🪛 Biome (1.9.4)

src/components/organisms/BrowserWindow.tsx

[error] 138-138: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

maxun-core/src/browserSide/scraper.js

[error] 248-248: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

[error] 416-416: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

server/src/workflow-management/selector.ts

[error] 1796-1797: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

[error] 1828-1828: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

[error] 1845-1846: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

[error] 1874-1874: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

🔇 Additional comments (19)

src/components/organisms/BrowserWindow.tsx (5)

127-128: Approve additional child selectors validation.
This check for valid child selectors looks appropriate and robust.

139-139: Comment clarity approved.
This line helps future readers understand the highlight logic.

163-163: Clear, concise comment.
This comment clarifies that we reset the highlighter if no valid child is found. Looks good.

167-167: Essential comment for list selector initialization.
Retaining this comment is helpful for maintainers who need to revisit how the listSelector mechanism is initialized.

171-171: Confirm non-list fallback comment.
This line clarifies how the fallback is handled outside the list-based logic.

maxun-core/src/browserSide/scraper.js (4)

207-257: Check for cross-origin access and potential fallback handling.
The findAllElements function robustly accounts for iframes, which is great. Ensure that any cross-origin frames produce graceful fallbacks, so partial results are still valid.

🧰 Tools

🪛 Biome (1.9.4)

[error] 248-248: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

258-283: Robust URL resolution logic.
The getElementValue function does a good job of handling relative URLs in iframes. Verify that it covers all relevant attributes beyond href and src—particularly for custom data attributes if needed.

Line range hint 444-466: Fine-grained logic for table imports.
This table-ancestor logic is very detailed. Consider simpler approaches if possible, but overall it's correct for specialized table structures.

557-572: Similar container detection.
Avoid unintentionally scraping unrelated containers if the layout changes. Possibly add heuristics or checks to confirm container similarity beyond class-based heuristics.

server/src/workflow-management/selector.ts (5)

Line range hint 80-101: Accurate property definitions for element info.
Storing iframeIndex and frameHierarchy is a helpful approach for debugging and re-targeting.

172-220: Check for nested table or body logic.
The large block checks if an element is a table or crosses the body. Confirm that the final break condition is always correct for complex iframes or edge-case elements.

374-420: Coordinates adjustment for multi-frame elements.
This code is well-structured to re-map element positions from an iframe to the top window. Test thoroughly if iFrames are offset or scrolled.

588-599: Neat rectangular toJSON logic.
Exposing bounding rects with a toJSON() method is a thoughtful approach for debugging or server calls.

Line range hint 1371-1473: Non-unique selectors for iframes.
The approach of generating partial selectors (like :nth-child()) can be fragile in dynamic DOMs. Verify that ephemeral changes (e.g., ads injecting extra elements) won’t break these partial selectors.

server/src/workflow-management/utils.ts (1)

15-19: Good fallback for iframe selectors.
Prioritizing iframeSelector.full is consistent with the new nested DOM handling. Make sure the rest of the system respects this if the user’s environment restricts iframes.

server/src/types/index.ts (3)

132-135: Well-defined iframe selector interface.
Defining IframeSelector clarifies how iframe-based selectors differ from standard selectors, improving maintainability and type safety.

151-151: Great extension for iframes.
Adding the iframeSelector property to Selectors ensures thorough coverage for nested elements. This is logically consistent with the rest of the PR.

165-165: Strengthen type usage.
Using selectors: Selectors instead of a generic object is a best practice, ensuring consistency and discoverability of properties like iframeSelector.

maxun-core/src/interpret.ts (1)

669-681: Validate that removing iframe selectors meets workflow expectations.
Stripping selectors mid-workflow can sometimes create unexpected behavior if subsequent steps rely on them. It might be beneficial to log or otherwise track which selectors were removed before returning the modified workflow. This would aid debugging in scenarios where the removal of selectors inadvertently breaks a downstream action.

coderabbitai · 2025-01-04T16:42:52Z

server/src/workflow-management/selector.ts

+          // Helper function to find elements within iframes, handling nested cases
+          const getDeepestElementFromPoint = (x: number, y: number): HTMLElement | null => {
+            // First, get the element at the clicked coordinates in the main document
+            let element = document.elementFromPoint(x, y) as HTMLElement;
+            if (!element) return null;
+
+            // If it's not an iframe, return the element as is
+            if (element.tagName !== 'IFRAME') return element;
+
+            // Initialize tracking variables for iframe traversal
+            let currentIframe = element as HTMLIFrameElement;
+            let deepestElement = element;
+
+            // Continue traversing while we have a valid iframe
+            while (currentIframe) {
+                try {
+                    // Convert the coordinates from main document space to iframe's local space
+                    const iframeRect = currentIframe.getBoundingClientRect();
+                    const iframeX = x - iframeRect.left;
+                    const iframeY = y - iframeRect.top;
+
+                    // Get the iframe's document object - this gives us access to the iframe's content
+                    const iframeDocument = currentIframe.contentDocument || currentIframe.contentWindow?.document;
+                    if (!iframeDocument) break;
+
+                    // Find the element at the transformed coordinates within the iframe
+                    const iframeElement = iframeDocument.elementFromPoint(iframeX, iframeY) as HTMLElement;
+
+                    // If no element found or it's the same as current, stop traversing
+                    if (!iframeElement) break;
+
+                    // Update our tracking of the deepest element
+                    deepestElement = iframeElement;
+
+                    // If we found another iframe, continue traversing through it
+                    if (iframeElement.tagName === 'IFRAME') {
+                        currentIframe = iframeElement as HTMLIFrameElement;
+                    } else {
+                        // If it's not an iframe, we've reached the deepest level
+                        break;
+                    }
+                } catch (error) {
+                    // Handle potential cross-origin security restrictions
+                    console.warn('Cannot access iframe content:', error);
+                    break;
+                }
+            }
+            return deepestElement;
+          };


🛠️ Refactor suggestion

Thorough iframe element-from-point logic.
This multi-level iframe traversal for getDeepestElementFromPoint is quite detailed. Ensure that repeated code remains in sync if similar functions exist. Consider extracting into a shared helper.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

server/src/workflow-management/selector.ts (1)

27-89: 🛠️ Refactor suggestion

Consolidate duplicate getDeepestElementFromPoint implementations.

The function is implemented multiple times with similar logic. As noted in a previous review, this should be consolidated into a shared helper to ensure the implementations remain in sync.

Consider moving this to a shared utility file that can be imported where needed.

Also applies to: 1193-1261

🧹 Nitpick comments (3)

src/components/organisms/BrowserWindow.tsx (2)

121-123: Remove debug logging statements.

These console.log statements should be removed before merging.

-  console.log("LIST SELECTOR", listSelector);
-  console.log("DATA SELECTOR", data.selector);
-  console.log("CHILD SELECTORS", data.childSelectors);

141-161: Consider refactoring the selector validation logic.

The nested conditions for handling different selector types (iframe, shadow DOM, mixed) make the code hard to follow. Consider extracting these checks into separate helper functions.

+ const isValidIframeChild = (selector: string, childSelectors: string[]) => {
+   return childSelectors.some(childSelector => 
+     selector.includes(':>>') && 
+     childSelector.split(':>>').some(part => 
+       selector.includes(part.trim())
+     )
+   );
+ };

+ const isValidMixedSelector = (selector: string, childSelectors: string[]) => {
+   const selectorParts = selector.split(':>>').map(part => part.trim());
+   return selectorParts.some(part => 
+     childSelectors.some(childSelector => 
+       childSelector.includes(part)
+     )
+   );
+ };

  if (data.elementInfo?.isIframeContent && data.childSelectors) {
-   const isIframeChild = data.childSelectors.some(childSelector => 
-     data.selector.includes(':>>') &&
-     childSelector.split(':>>').some(part => 
-       data.selector.includes(part.trim())
-     )
-   );
+   const isIframeChild = isValidIframeChild(data.selector, data.childSelectors);
    setHighlighterData(isIframeChild ? data : null);
  } else if (data.selector.includes(':>>') && hasValidChildSelectors) {
-   const selectorParts = data.selector.split(':>>').map(part => part.trim());
-   const isValidMixedSelector = selectorParts.some(part => 
-     data.childSelectors!.some(childSelector => 
-       childSelector.includes(part)
-     )
-   );
+   const isValid = isValidMixedSelector(data.selector, data.childSelectors!);
-   setHighlighterData(isValidMixedSelector ? data : null);
+   setHighlighterData(isValid ? data : null);
  }

maxun-core/src/browserSide/scraper.js (1)

208-273: Consider caching DOM queries and adding early exits.

The function performs well but could be optimized:

Cache querySelector results when reused

Add early exits for empty results

Consider using WeakMap for caching across multiple calls
+ const elementCache = new WeakMap();
+ const getCachedElements = (root, selector) => {
+   if (!elementCache.has(root)) {
+     elementCache.set(root, new Map());
+   }
+   const cache = elementCache.get(root);
+   if (!cache.has(selector)) {
+     cache.set(selector, Array.from(root.querySelectorAll(selector)));
+   }
+   return cache.get(selector);
+ };

  if (!config.selector.includes('>>') && !config.selector.includes(':>>')) {
-   return Array.from(document.querySelectorAll(config.selector));
+   return getCachedElements(document, config.selector);
  }
🧰 Tools

🪛 Biome (1.9.4)

[error] 264-264: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e775fa and d1e8239.

📒 Files selected for processing (6)

maxun-core/src/browserSide/scraper.js (4 hunks)
maxun-core/src/interpret.ts (1 hunks)
server/src/types/index.ts (2 hunks)
server/src/workflow-management/selector.ts (21 hunks)
server/src/workflow-management/utils.ts (1 hunks)
src/components/organisms/BrowserWindow.tsx (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

server/src/workflow-management/utils.ts
maxun-core/src/interpret.ts
server/src/types/index.ts

🧰 Additional context used

🪛 Biome (1.9.4)

src/components/organisms/BrowserWindow.tsx

[error] 138-138: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

maxun-core/src/browserSide/scraper.js

[error] 264-264: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

[error] 453-453: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

[error] 552-552: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

server/src/workflow-management/selector.ts

[error] 2071-2072: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

[error] 2154-2154: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

🔇 Additional comments (3)

src/components/organisms/BrowserWindow.tsx (1)

15-15: LGTM! Well-typed interface extension.

The addition of isIframeContent property to the ElementInfo interface properly supports iframe content detection.

maxun-core/src/browserSide/scraper.js (1)

275-297: LGTM! Robust URL resolution implementation.

The function properly handles URL resolution in iframe contexts and includes safe property access. The implementation is thorough and handles edge cases well.

server/src/workflow-management/selector.ts (1)

1267-1331: LGTM! Well-structured selector generation logic.

The implementation provides comprehensive handling of both iframe and shadow DOM contexts with:

Clear separation of concerns in helper functions

Proper error handling and logging

Good depth limiting to prevent infinite recursion

coderabbitai · 2025-01-04T18:37:53Z

maxun-core/src/browserSide/scraper.js

+    // Enhanced query function to handle both iframe and shadow DOM
+    const queryElement = (rootElement, selector) => {
+      if (!selector.includes('>>') && !selector.includes(':>>')) {
          return rootElement.querySelector(selector);
-        }
-
-        const parts = selector.split('>>').map(part => part.trim());
-        let currentElement = rootElement;
-
-        for (let i = 0; i < parts.length; i++) {
-            if (!currentElement) return null;
+      }

-            if (!currentElement.querySelector && !currentElement.shadowRoot) {
-                currentElement = document.querySelector(parts[i]);
-                continue;
-            }
+      const parts = selector.split(/(?:>>|:>>)/).map(part => part.trim());
+      let currentElement = rootElement;
+
+      for (let i = 0; i < parts.length; i++) {
+          if (!currentElement) return null;
+
+          // Handle iframe traversal
+          if (currentElement.tagName === 'IFRAME') {
+              try {
+                  const iframeDoc = currentElement.contentDocument || currentElement.contentWindow.document;
+                  currentElement = iframeDoc.querySelector(parts[i]);
+                  continue;
+              } catch (e) {
+                  console.warn('Cannot access iframe content:', e);
+                  return null;
+              }
+          }

-            let nextElement = currentElement.querySelector(parts[i]);
+          // Try regular DOM first
+          let nextElement = currentElement.querySelector(parts[i]);

-            if (!nextElement && currentElement.shadowRoot) {
-                nextElement = currentElement.shadowRoot.querySelector(parts[i]);
-            }
+          // Try shadow DOM if not found
+          if (!nextElement && currentElement.shadowRoot) {
+              nextElement = currentElement.shadowRoot.querySelector(parts[i]);
+          }

-            if (!nextElement) {
-                const allChildren = Array.from(currentElement.children || []);
-                for (const child of allChildren) {
-                    if (child.shadowRoot) {
-                        nextElement = child.shadowRoot.querySelector(parts[i]);
-                        if (nextElement) break;
-                    }
-                }
-            }
+          // Check children's shadow roots if still not found
+          if (!nextElement) {
+              const children = Array.from(currentElement.children || []);
+              for (const child of children) {
+                  if (child.shadowRoot) {
+                      nextElement = child.shadowRoot.querySelector(parts[i]);
+                      if (nextElement) break;
+                  }
+              }
+          }

-            currentElement = nextElement;
-        }
+          currentElement = nextElement;
+      }

-        return currentElement;
+      return currentElement;
    };

-    const queryShadowDOMAll = (rootElement, selector) => {
-        if (!selector.includes('>>')) {
+    // Enhanced query all function for both contexts
+    const queryElementAll = (rootElement, selector) => {
+      if (!selector.includes('>>') && !selector.includes(':>>')) {
          return rootElement.querySelectorAll(selector);
-        }
+      }

-        const parts = selector.split('>>').map(part => part.trim());
-        let currentElements = [rootElement];
-
-        for (const part of parts) {
-            const nextElements = [];
-
-            for (const element of currentElements) {
-                if (element.querySelectorAll) {
-                    nextElements.push(...element.querySelectorAll(part));
-                }
-
-                if (element.shadowRoot) {
-                    nextElements.push(...element.shadowRoot.querySelectorAll(part));
-                }
-
-                const children = Array.from(element.children || []);
-                for (const child of children) {
-                    if (child.shadowRoot) {
-                        nextElements.push(...child.shadowRoot.querySelectorAll(part));
-                    }
-                }
-            }
-
-            currentElements = nextElements;
-        }
-
-        return currentElements;
+      const parts = selector.split(/(?:>>|:>>)/).map(part => part.trim());
+      let currentElements = [rootElement];
+
+      for (const part of parts) {
+          const nextElements = [];
+
+          for (const element of currentElements) {
+              // Handle iframe traversal
+              if (element.tagName === 'IFRAME') {
+                  try {
+                      const iframeDoc = element.contentDocument || element.contentWindow.document;
+                      nextElements.push(...iframeDoc.querySelectorAll(part));
+                  } catch (e) {
+                      console.warn('Cannot access iframe content:', e);
+                      continue;
+                  }
+              } else {
+                  // Regular DOM elements
+                  if (element.querySelectorAll) {
+                      nextElements.push(...element.querySelectorAll(part));
+                  }
+
+                  // Shadow DOM elements
+                  if (element.shadowRoot) {
+                      nextElements.push(...element.shadowRoot.querySelectorAll(part));
+                  }
+
+                  // Check children's shadow roots
+                  const children = Array.from(element.children || []);
+                  for (const child of children) {
+                      if (child.shadowRoot) {
+                          nextElements.push(...child.shadowRoot.querySelectorAll(part));
+                      }
+                  }
+              }
+          }
+
+          currentElements = nextElements;
+      }
+
+      return currentElements;


🛠️ Refactor suggestion

Reduce code duplication in iframe traversal logic.

The iframe traversal logic is duplicated across multiple implementations of getDeepestElementFromPoint. Consider extracting this into a shared utility function.

+ const traverseIframe = (iframe, x, y, depth = 0) => { + const MAX_DEPTH = 4; + if (depth >= MAX_DEPTH) return null; + + try { + const rect = iframe.getBoundingClientRect(); + const localX = x - rect.left; + const localY = y - rect.top; + + const doc = iframe.contentDocument || iframe.contentWindow?.document; + if (!doc) return null; + + return doc.elementFromPoint(localX, localY); + } catch (error) { + console.warn('Cannot access iframe content:', error); + return null; + } + }; const queryElement = (rootElement, selector) => { if (!selector.includes('>>') && !selector.includes(':>>')) { return rootElement.querySelector(selector); } // ... existing code ... if (currentElement.tagName === 'IFRAME') { - try { - const iframeDoc = currentElement.contentDocument || currentElement.contentWindow.document; - currentElement = iframeDoc.querySelector(parts[i]); - continue; - } catch (e) { - console.warn('Cannot access iframe content:', e); - return null; - } + const iframeElement = traverseIframe(currentElement, x, y); + if (!iframeElement) return null; + currentElement = iframeElement; + continue; } };

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Biome (1.9.4)

[error] 453-453: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

feat: iframe support for get element info

0d763f7

amhsirak marked this pull request as draft December 20, 2024 13:02

amhsirak added the Type: Feature New features label Dec 20, 2024

amhsirak added the Status: Work In Progess This issue/PR is actively being worked on label Dec 20, 2024

amhsirak and others added 15 commits December 20, 2024 20:28

feat: iframe support for getRect

6904933

chore: fix format

8ba928d

feat: add iframeSelector generation logic for capture text

b6faf5c

feat: add nested iframe selector generation func for capture text

3f73a48

feat: add iframe selector type for selectors

9eb4ec3

feat: prioritize returning iframe selector if exists

2943681

feat: get deeply nested iframe element info

360fe63

feat: improve rect generation for iframe elements

0eb9a8f

feat: add iframe support for scrapeSchema action

a845509

feat: improve non unique selector generation for capture list

7018ba6

feat: add iframe support for child selector generation

ed1ea41

feat: add iframe support for highlighter logic

1c13a23

feat: add iframe support for table and non table scraping

96049e3

feat: add func to rm iframe selector from workflow

7b08471

Merge branch 'develop' into iframe

5e775fa

amhsirak marked this pull request as ready for review January 4, 2025 16:38

amhsirak commented Jan 4, 2025

View reviewed changes

coderabbitai bot reviewed Jan 4, 2025

View reviewed changes

Merge branch 'develop' into iframe

d1e8239

coderabbitai bot reviewed Jan 4, 2025

View reviewed changes

RohitR311 merged commit c132f27 into develop Jan 4, 2025
1 check passed

coderabbitai bot mentioned this pull request Jan 4, 2025

feat: compatibility check for shadowDOM and iframe scraping #320

Merged

coderabbitai bot mentioned this pull request Jan 28, 2025

feat: better IFrame and shadowDom elements scraping #409

Closed

coderabbitai bot mentioned this pull request Mar 4, 2025

feat: improve highlighting and data extraction #462

Merged

coderabbitai bot mentioned this pull request Jul 6, 2025

feat: better, faster, smarter capture list #675

Merged

coderabbitai bot mentioned this pull request Jul 15, 2025

feat(maxun-core): child extraction + deep shadow dom #694

Merged

coderabbitai bot mentioned this pull request Jul 28, 2025

feat: auto add capture fields for list selection #710

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support `iframe` extraction #275

feat: support `iframe` extraction #275

Uh oh!

amhsirak commented Dec 20, 2024 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 20, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

amhsirak left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 4, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: support iframe extraction #275

feat: support iframe extraction #275

Uh oh!

Conversation

amhsirak commented Dec 20, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested Labels

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

amhsirak left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: support `iframe` extraction #275

feat: support `iframe` extraction #275

amhsirak commented Dec 20, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 20, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)