-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat(maxun-core): child extraction + deep shadow dom #694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change introduces comprehensive support for shadow DOM in selector generation, DOM highlighting, and data extraction. It adds shadow DOM awareness to state, props, and internal logic across browser window, renderer, context, and helper modules. Selector generation, caching, and highlighter logic are refactored to robustly handle shadow roots, similar elements, and optimized XPath evaluation, with UI and control flow enhancements for improved user feedback. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant BrowserWindow
participant DOMBrowserRenderer
participant ClientSelectorGenerator
participant ClientListExtractor
User->>BrowserWindow: Interacts with DOM (hover/select)
BrowserWindow->>DOMBrowserRenderer: Passes selection/highlight event
DOMBrowserRenderer->>ClientSelectorGenerator: generateDataForHighlighter(coordinates, doc, ...)
ClientSelectorGenerator->>ClientSelectorGenerator: Traverse DOM (including shadow roots)
ClientSelectorGenerator->>BrowserWindow: Return selector, isShadow, similarElements
BrowserWindow->>BrowserWindow: Update highlighterData state
BrowserWindow->>DOMBrowserRenderer: Render highlights (with shadow DOM and similar elements)
User->>BrowserWindow: Confirms selection
BrowserWindow->>ClientListExtractor: extractListData(doc, selector, fields, isShadow)
ClientListExtractor->>ClientListExtractor: Evaluate XPath stepwise (shadow DOM aware)
ClientListExtractor->>BrowserWindow: Return extracted data
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
src/components/browser/BrowserWindow.tsx (1)
723-735: Remove duplicate useEffect hook.This useEffect is an exact duplicate of the one at lines 708-721. This will cause event listeners to be registered twice, potentially leading to performance issues and unexpected behavior.
Remove this entire duplicate useEffect block.
🧹 Nitpick comments (10)
src/helpers/clientListExtractor.ts (1)
439-544: Comprehensive shadow DOM XPath evaluation implementationThe new
evaluateXPathSinglemethod provides robust shadow DOM support with proper handling of:
- Indexed XPath expressions
- Step-by-step traversal through shadow boundaries
- Correct positional selector handling (1-based to 0-based conversion)
However, due to the complexity of this implementation, consider adding performance monitoring to ensure the step-by-step evaluation doesn't impact extraction performance on pages with deep shadow DOM trees.
Consider adding detailed JSDoc documentation for this method to explain the shadow DOM traversal algorithm, especially the handling of indexed paths and context switching between Document, Element, and ShadowRoot.
src/helpers/clientSelectorGenerator.ts (9)
132-132: Remove unuseddebounceMsproperty.The
debounceMsproperty in the performance configuration is defined but never used in the code.Remove the unused property:
private performanceConfig = { enableSpatialIndexing: true, maxSelectorBatchSize: 50, useElementCache: true, - debounceMs: 16, // ~60fps };
183-227: Refactor duplicated custom element check.The custom element check
tagName.includes("-")is duplicated and the exclusion logic can be simplified.Extract the custom element check and simplify the logic:
const tagName = element.tagName.toLowerCase(); - -const isCustomElement = tagName.includes("-"); - -const standardExcludeSelectors = [ - "script", - "style", - "meta", - "link", - "title", - "head", -]; -if (!isCustomElement && standardExcludeSelectors.includes(tagName)) { - return null; -} - +const isCustomElement = tagName.includes("-"); +const isExcludedTag = !isCustomElement && this.groupingConfig.excludeSelectors.includes(tagName); + +if (isExcludedTag) return null; -if (this.groupingConfig.excludeSelectors.includes(tagName)) return null; // ... rest of the method ... const relevantAttributes = Array.from(element.attributes) .filter((attr) => { - if (isCustomElement) { - return ![ - "id", - "style", - "data-reactid", - "data-react-checksum", - ].includes(attr.name.toLowerCase()); - } else { - return ( - !["id", "style", "data-reactid", "data-react-checksum"].includes( - attr.name.toLowerCase() - ) && - (!attr.name.startsWith("data-") || - attr.name === "data-type" || - attr.name === "data-role") - ); - } + const excludedAttrs = ["id", "style", "data-reactid", "data-react-checksum"]; + if (excludedAttrs.includes(attr.name.toLowerCase())) return false; + + if (!isCustomElement && attr.name.startsWith("data-") && + attr.name !== "data-type" && attr.name !== "data-role") { + return false; + } + + return true; })
333-366: Consider performance optimizations for large DOM trees.The
getAllVisibleElementsWithShadowmethod performs visibility checks on every element, which could be expensive for large DOM trees.Consider adding a depth limit and lazy evaluation:
-private getAllVisibleElementsWithShadow(doc: Document): HTMLElement[] { +private getAllVisibleElementsWithShadow(doc: Document, maxDepth: number = 10): HTMLElement[] { const allElements: HTMLElement[] = []; const visited = new Set<HTMLElement>(); - const traverseContainer = (container: Document | ShadowRoot) => { + const traverseContainer = (container: Document | ShadowRoot, depth: number = 0) => { + if (depth > maxDepth) { + console.warn(`Max shadow DOM depth ${maxDepth} reached`); + return; + } + try { const elements = Array.from(container.querySelectorAll("*")).filter( (el) => { const rect = el.getBoundingClientRect(); return rect.width > 0 && rect.height > 0; // Only visible elements } ) as HTMLElement[]; elements.forEach((element) => { if (!visited.has(element)) { visited.add(element); allElements.push(element); // Traverse shadow DOM if it exists if (element.shadowRoot) { - traverseContainer(element.shadowRoot); + traverseContainer(element.shadowRoot, depth + 1); } } }); } catch (error) { console.warn(`⚠️ Error traversing container:`, error); } }; // Start from main document - traverseContainer(doc); + traverseContainer(doc, 0); return allElements; }
2821-2824: Add more context to error logging.The error logging lacks context about which selector caused the error.
} catch (error) { - console.error("Error in getChildSelectors:", error); + console.error("Error in getChildSelectors for parent selector:", parentSelector, error); return []; }
3214-3214: Make performance timing conditional on debug mode.The console.time/timeEnd calls should be conditional to avoid console pollution in production.
Consider adding a debug flag:
+private debugMode = false; + // In precomputeSelectorMappings method: -console.time("Precomputing selector mappings"); +if (this.debugMode) console.time("Precomputing selector mappings"); // ... rest of the method ... -console.timeEnd("Precomputing selector mappings"); +if (this.debugMode) console.timeEnd("Precomputing selector mappings");Also applies to: 3252-3252
3323-3326: Use optional chaining for cleaner code.The null check can be simplified using optional chaining.
-const cachedElements = this.selectorElementCache.get(selector); -if (cachedElements && cachedElements.includes(targetElement)) { +const cachedElements = this.selectorElementCache.get(selector); +if (cachedElements?.includes(targetElement)) { matches.push(selector); }
3327-3330: Remove unnecessary continue statement.The continue statement at the end of a loop iteration is redundant.
} catch (error) { - continue; + // Skip invalid selectors }
3405-3408: Remove unnecessary continue statement.The continue statement at the end of a loop iteration is redundant.
} catch (error) { - continue; + // Skip invalid selectors }
1-4233: Consider splitting this large file into smaller, focused modules.The
ClientSelectorGeneratorclass has grown to over 4000 lines and handles many responsibilities including DOM traversal, XPath generation, caching, spatial indexing, and shadow DOM handling. This violates the Single Responsibility Principle.Consider refactoring into separate modules:
ShadowDOMTraverser- Shadow DOM traversal logicXPathGenerator- XPath generation and optimizationSelectorCache- Caching and spatial indexing logicElementMatcher- Element similarity and matching logicGroupAnalyzer- Element grouping logicThis would improve maintainability, testability, and make the codebase easier to understand.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/components/browser/BrowserWindow.tsx(18 hunks)src/components/recorder/DOMBrowserRenderer.tsx(8 hunks)src/components/recorder/RightSidePanel.tsx(1 hunks)src/context/browserSteps.tsx(6 hunks)src/helpers/clientListExtractor.ts(6 hunks)src/helpers/clientSelectorGenerator.ts(25 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
src/helpers/clientSelectorGenerator.ts
[error] 3324-3325: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 3330-3332: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 3408-3410: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
🔇 Additional comments (11)
src/components/recorder/RightSidePanel.tsx (1)
463-498: Consistent shadow DOM metadata propagationThe changes properly extend the list settings object to include the
isShadowboolean property at all relevant levels (fields, pagination, and settings). This ensures shadow DOM context is preserved throughout the list extraction process.src/context/browserSteps.tsx (1)
8-8: Well-structured shadow DOM support in browser stepsThe changes consistently implement the
isShadowproperty across all relevant interfaces and maintain proper state management. The rename fromshadowtoisShadowimproves clarity, and the logic correctly preserves existing values when updating steps.Also applies to: 25-30, 42-42, 74-82, 116-116, 138-138
src/components/recorder/DOMBrowserRenderer.tsx (2)
105-105: Proper shadow DOM support and caching mode integrationThe implementation correctly adds
isCachingChildSelectorsprop to prevent rendering during child selector caching operations. Shadow DOM metadata (isShadowandsimilarElements) is properly propagated through highlight and selection callbacks.Also applies to: 158-158, 249-250, 304-307, 411-411, 760-760
1087-1087: Cursor style remains enforced by global ruleThe removal of
!importanton the inline style at line 1087 does not weaken the pointer cursor: the surrounding CSS block in the same file still applies/* src/components/recorder/DOMBrowserRenderer.tsx:807–809 */ * { cursor: "pointer" !important; }This global rule ensures the pointer cursor is always shown, and there are no other conflicting overrides.
src/helpers/clientListExtractor.ts (2)
9-9: Consistent property naming for shadow DOM contextThe rename from
shadowtoisShadowmaintains consistency across the codebase and improves clarity.Also applies to: 22-22, 329-329
546-672: Well-structured XPath condition matchingThe helper methods
queryInsideContext,parseXPathPart, andelementMatchesConditionprovide comprehensive support for various XPath conditions including attributes, classes, text content, and element counts. The implementation correctly handles edge cases and provides good debugging output.src/components/browser/BrowserWindow.tsx (5)
150-161: LGTM! Well-structured state extension for shadow DOM support.The addition of optional properties for shadow DOM awareness and similar elements is properly typed and maintains backward compatibility.
1200-1257: Modal restructuring looks good.The attribute selection modal has been restructured with improved layout and custom styling. The implementation maintains functionality while providing better visual organization.
1519-1547: Well-implemented loading overlay.The loading overlay provides clear visual feedback during child selector caching and properly prevents user interaction. The styling is consistent with the application theme.
514-529: Excellent implementation of similar elements highlighting.The coordinate mapping and rendering logic for similar elements is well-implemented. The visual differentiation with dashed borders and labels provides clear user feedback.
Also applies to: 1399-1445
175-184: Ignore unusedsetPendingNotificationwarningThe
setPendingNotificationsetter is used later in this file (e.g. it’s called to clear the notification around line 348). The original comment’s claim that it’s never called is therefore incorrect—no further action is needed here.Likely an incorrect or invalid review comment.
| <style>{` | ||
| @keyframes spin { | ||
| 0% { transform: rotate(0deg); } | ||
| 100% { transform: rotate(360deg); } | ||
| } | ||
| `}</style> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate CSS animation.
This CSS keyframe animation is already defined at lines 1292-1297. Having duplicate styles increases bundle size unnecessarily.
Remove this duplicate style block.
🤖 Prompt for AI Agents
In src/components/browser/BrowserWindow.tsx between lines 1509 and 1514, there
is a duplicate CSS keyframe animation named "spin" that is already defined at
lines 1292-1297. To reduce bundle size and avoid redundancy, remove this
duplicate style block entirely from lines 1509 to 1514.
| isShadow: highlighterData.elementInfo?.isShadowRoot, | ||
| attribute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Standardize shadow DOM flag usage.
The code inconsistently uses elementInfo?.isShadowRoot and highlighterData?.isShadow to determine if an element is in shadow DOM. This could lead to bugs if these values differ.
Standardize on using highlighterData?.isShadow as the source of truth:
- isShadow: highlighterData.elementInfo?.isShadowRoot,
+ isShadow: highlighterData.isShadow,Apply this change to all occurrences for consistency.
Also applies to: 908-908, 987-987, 1081-1081, 1135-1135, 1148-1148
🤖 Prompt for AI Agents
In src/components/browser/BrowserWindow.tsx at lines 853-854 and also at lines
908, 987, 1081, 1135, and 1148, replace all uses of elementInfo?.isShadowRoot
with highlighterData?.isShadow to standardize the shadow DOM flag usage. This
means wherever isShadowRoot is used to determine shadow DOM presence, update it
to use highlighterData?.isShadow instead, ensuring consistent and reliable
checks throughout the file.
| setTimeout(() => { | ||
| try { | ||
| const childSelectors = | ||
| clientSelectorGenerator.getChildSelectors( | ||
| iframeElement.contentDocument as Document, | ||
| listSelector | ||
| ); | ||
|
|
||
| clientSelectorGenerator.precomputeChildSelectorMappings( | ||
| childSelectors, | ||
| iframeElement.contentDocument as Document | ||
| ); | ||
|
|
||
| setCachedChildSelectors([]); | ||
| setCachedChildSelectors(childSelectors); | ||
| } catch (error) { | ||
| console.error("Error during child selector caching:", error); | ||
| } finally { | ||
| setIsCachingChildSelectors(false); | ||
|
|
||
| if (currentSnapshot) { | ||
| const iframeElement = document.querySelector( | ||
| "#dom-browser-iframe" | ||
| ) as HTMLIFrameElement; | ||
| if (iframeElement?.contentDocument) { | ||
| const childSelectors = clientSelectorGenerator.getChildSelectors( | ||
| iframeElement.contentDocument, | ||
| listSelector | ||
| ); | ||
| setCachedChildSelectors(childSelectors); | ||
| if (pendingNotification) { | ||
| notify(pendingNotification.type, pendingNotification.message); | ||
| setPendingNotification(null); | ||
| } | ||
| } | ||
| } | ||
| }, 100); | ||
| } else { | ||
| setIsCachingChildSelectors(false); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve error handling and remove arbitrary timeout.
The child selector caching implementation has several issues:
- The 100ms timeout seems arbitrary and could cause race conditions
- Errors are only logged to console without notifying the user
- The pendingNotification is never set, making lines 347-350 dead code
Consider this improved implementation:
- setTimeout(() => {
try {
const childSelectors =
clientSelectorGenerator.getChildSelectors(
iframeElement.contentDocument as Document,
listSelector
);
clientSelectorGenerator.precomputeChildSelectorMappings(
childSelectors,
iframeElement.contentDocument as Document
);
setCachedChildSelectors(childSelectors);
} catch (error) {
console.error("Error during child selector caching:", error);
+ notify("error", "Failed to cache child selectors. Please try again.");
} finally {
setIsCachingChildSelectors(false);
-
- if (pendingNotification) {
- notify(pendingNotification.type, pendingNotification.message);
- setPendingNotification(null);
- }
}
- }, 100);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| setTimeout(() => { | |
| try { | |
| const childSelectors = | |
| clientSelectorGenerator.getChildSelectors( | |
| iframeElement.contentDocument as Document, | |
| listSelector | |
| ); | |
| clientSelectorGenerator.precomputeChildSelectorMappings( | |
| childSelectors, | |
| iframeElement.contentDocument as Document | |
| ); | |
| setCachedChildSelectors([]); | |
| setCachedChildSelectors(childSelectors); | |
| } catch (error) { | |
| console.error("Error during child selector caching:", error); | |
| } finally { | |
| setIsCachingChildSelectors(false); | |
| if (currentSnapshot) { | |
| const iframeElement = document.querySelector( | |
| "#dom-browser-iframe" | |
| ) as HTMLIFrameElement; | |
| if (iframeElement?.contentDocument) { | |
| const childSelectors = clientSelectorGenerator.getChildSelectors( | |
| iframeElement.contentDocument, | |
| listSelector | |
| ); | |
| setCachedChildSelectors(childSelectors); | |
| if (pendingNotification) { | |
| notify(pendingNotification.type, pendingNotification.message); | |
| setPendingNotification(null); | |
| } | |
| } | |
| } | |
| }, 100); | |
| } else { | |
| setIsCachingChildSelectors(false); | |
| } | |
| } | |
| try { | |
| const childSelectors = | |
| clientSelectorGenerator.getChildSelectors( | |
| iframeElement.contentDocument as Document, | |
| listSelector | |
| ); | |
| clientSelectorGenerator.precomputeChildSelectorMappings( | |
| childSelectors, | |
| iframeElement.contentDocument as Document | |
| ); | |
| setCachedChildSelectors(childSelectors); | |
| } catch (error) { | |
| console.error("Error during child selector caching:", error); | |
| notify("error", "Failed to cache child selectors. Please try again."); | |
| } finally { | |
| setIsCachingChildSelectors(false); | |
| } |
🤖 Prompt for AI Agents
In src/components/browser/BrowserWindow.tsx lines 328 to 356, remove the
arbitrary 100ms setTimeout to avoid race conditions by executing the child
selector caching logic synchronously or using proper async handling. Enhance
error handling by notifying the user of errors instead of only logging to the
console. Also, remove or properly implement the pendingNotification logic since
it is currently never set and results in dead code.
| } catch (error) { | ||
| console.error("Error getting matching elements:", error); | ||
| return []; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling to distinguish between errors and empty results.
The catch block returns an empty array for all errors, making it difficult to distinguish between legitimate empty results and actual errors.
Consider logging the error type and potentially re-throwing for critical errors:
} catch (error) {
- console.error("Error getting matching elements:", error);
+ if (error instanceof SyntaxError || error instanceof DOMException) {
+ console.error("Invalid selector in getAllMatchingElements:", hoveredSelector, error);
+ } else {
+ console.error("Unexpected error in getAllMatchingElements:", error);
+ }
return [];
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } catch (error) { | |
| console.error("Error getting matching elements:", error); | |
| return []; | |
| } | |
| } catch (error) { | |
| if (error instanceof SyntaxError || error instanceof DOMException) { | |
| console.error("Invalid selector in getAllMatchingElements:", hoveredSelector, error); | |
| } else { | |
| console.error("Unexpected error in getAllMatchingElements:", error); | |
| } | |
| return []; | |
| } |
🤖 Prompt for AI Agents
In src/helpers/clientSelectorGenerator.ts around lines 594 to 597, the catch
block currently returns an empty array for all errors, which conflates actual
errors with legitimate empty results. Modify the error handling to log the error
type and message clearly, and re-throw critical errors instead of returning an
empty array. For non-critical errors, you can return an empty array but ensure
the distinction is clear in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
maxun-core/src/browserSide/scraper.js (1)
538-637: Consider performance implications of custom XPath evaluation.The enhanced
evaluateXPathfunction adds valuable shadow DOM support, but the custom implementation could be significantly slower than native XPath evaluation, especially for complex queries or large DOM trees.Consider these optimizations:
- Caching parsed XPath expressions to avoid repeated parsing:
+const xpathCache = new Map(); +const getCachedXPath = (xpath) => { + if (xpathCache.has(xpath)) { + return xpathCache.get(xpath); + } + const parsed = parseXPathExpression(xpath); + xpathCache.set(xpath, parsed); + return parsed; +};
- Early exit conditions for common cases:
+// Quick check for simple tag-only selectors +if (!xpath.includes('[') && !xpath.includes('(')) { + return document.querySelector(xpath.replace('//', '').replace('/', ' ')); +}
- Consider breaking this function into smaller, focused functions for better maintainability and testing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
maxun-core/src/browserSide/scraper.js(2 hunks)
🔇 Additional comments (5)
maxun-core/src/browserSide/scraper.js (5)
427-445: LGTM: Well-structured XPath query helper function.The
queryInsideContextfunction provides a clean interface for querying elements within a specific context using parsed XPath parts. The error handling and filtering logic are appropriate.
447-458: LGTM: Robust XPath parsing implementation.The
parseXPathPartfunction correctly extracts tag names and conditions from XPath expressions using appropriate regex patterns. The default handling for wildcards and condition parsing is well-implemented.
460-468: LGTM: Simple and effective condition matching.The
elementMatchesConditionsfunction provides a clear interface for checking multiple conditions against an element. The early return pattern is efficient.
470-536: Comprehensive XPath condition handling with good coverage.The
elementMatchesConditionfunction handles a wide variety of XPath conditions including attributes, text content, class containment, and child counts. The regex patterns are well-crafted and the logic is sound.
1209-1209: Shadow DOM flag propagation verifiedThe
isShadowflag is correctly defined on selector objects (e.g. in ClientListExtractor and clientSelectorGenerator), carried through the UI layers (BrowserStepsContext → RightSidePanel → BrowserWindow), and ultimately passed into the scraper’sfieldsconfig and consumed byevaluateXPath. No further changes are needed.
Summary by CodeRabbit
New Features
Improvements
Bug Fixes