-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: improve highlighting and data extraction #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes update the scraping and pagination logic across multiple modules. The scraper now handles both iframes and frames more robustly by detecting frame name attributes and improving element data extraction. The pagination logic in the interpret module gains a retry mechanism with updated selectors for button interactions. In the server modules, the selector generation and frame handling are extended with new helper functions and type updates to process nested frames and shadow DOM elements. Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant P as PaginationHandler
participant F as findWorkingButton
participant C as ButtonClick
U->>P: Trigger pagination (clickNext/clickLoadMore)
P->>F: Attempt to find working button with selectors
F-->>P: Return {button, workingSelector, updatedSelectors}
P->>C: Click found button
C-->>P: Report new item count
alt No new items found
P->>P: Retry (up to MAX_RETRIES)
end
P-->>U: Complete pagination process
sequenceDiagram
participant S as Scraper
participant F as FrameHandler
participant E as ElementSelector
S->>F: Check if selector starts with "frame[name=" or "iframe[name="
F->>E: Query using getElementsByName and querySelectorAll
E-->>F: Return matching frame/iframe elements
F-->>S: Provide frame elements for processing
S->>S: Extract element data (innerHTML/outerHTML)
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (10)
maxun-core/src/browserSide/scraper.js (2)
312-318: Refine base URL extraction logic.The current code gracefully handles exceptions and applies fallback logic for
baseURL. Consider using an optional chaining approach outside of the try/catch, if feasible, to simplify the logic and reduce reliance on exception handling.
672-674: Check contentDocument consistently.Fetching elements from
frame.contentDocumentcan fail under cross-origin restrictions or if the frame is not fully loaded. In these blocks, consider guardingframeDocwith additional null checks or a try/catch to ensure the code remains stable.Also applies to: 678-679
maxun-core/src/interpret.ts (3)
709-709: Remove superfluous spacing or unnecessary line.There's a new blank or unintentional line. Consider removing to maintain code cleanliness and consistency.
- +
805-808: Track selector changes across loop iterations.Reassigning
updatedSelectorstoavailableSelectorshelps keep state in sync across pagination steps. Ensure concurrency safety if multiple parallel tasks manipulate the sameavailableSelectorsvariable, or confirm that each concurrent path uses its own copy.
839-839: Clarify log message wording.The debug log for load-more clicks is helpful. Optionally include details like the total item count after each click to provide extra context for diagnosing pagination issues.
server/src/workflow-management/selector.ts (5)
36-37: Consider refactoring repeated shadow DOM depth handling.
These lines introduce or repeat adepthandMAX_SHADOW_DEPTHpattern for shadow DOM traversal. Re-declaring them across multiple sections can lead to scattered logic and potential inconsistencies.You could extract a helper function to handle shadow DOM traversal up to a configurable maximum depth, reducing duplication and improving readability.
Also applies to: 65-66, 136-137, 316-317, 586-587, 805-806, 1516-1517, 2333-2334
52-61: Streamline theisInFramesetchecks with optional chaining.
These blocks introduce a loop checkingnode && node.parentElement. Static analysis suggests using optional chains (e.g.while (node?.parentElement)) to reduce verbosity. Additionally, the same logic appears repeatedly in the codebase—consider centralizing theisInFramesetfunction to promote DRY (Don’t Repeat Yourself).- while (node && node.parentElement) { + while (node?.parentElement) { if (node.tagName === 'FRAMESET' || node.tagName === 'FRAME') { return true; } node = node.parentElement; }Also applies to: 303-312, 601-610, 792-801, 1503-1512, 1946-1955, 2320-2329
🧰 Tools
🪛 Biome (1.9.4)
[error] 54-54: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
94-146: Reduce duplication in frame handling blocks & consider removing unneededcontinue.
Blocks handling<FRAME>/<FRAMESET>appear multiple times with similar traversal logic (buildingframesToCheck, usingprocessFrames, etc.). Consider extracting a common helper to handle frames consistently and reduce repeated code.
Additionally, each loop contains acontinue;statement flagged by static analysis as possibly unnecessary—verify if skipping is needed or if simply proceeding to the next iteration is naturally handled.- console.warn('Cannot access frame content:', error); - continue; + console.warn('Cannot access frame content:', error); // Omit ‘continue;’ unless truly needed to skip extra logicAlso applies to: 344-396, 642-694, 833-886, 1544-1597, 1887-1939, 1987-2040, 2361-2414, 2748-2766
🧰 Tools
🪛 Biome (1.9.4)
[error] 141-141: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
452-513: Parallel addition of frame properties—avoid potential duplication.
This block mirrors the logic in lines 163-224, addingisFrameContent,frameURL, andframeIndexagain. Confirm you truly need to re-declare or re-assign them within this distinct path, or see if you can unify shared logic to keep code simpler.
968-968: Coalesce iframe coordinate adjustments under one shared approach.
Lines usingcurrentWindow?.frameElementand repeated logic to offset the bounding rect for frame/iframe elements occur in multiple places. A small helper performing these coordinate offsets for nested frames would respond well to future expansions (e.g., deeper nesting, exotic frame usage).Also applies to: 736-739, 1513-1517
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
maxun-core/src/browserSide/scraper.js(13 hunks)maxun-core/src/interpret.ts(6 hunks)server/src/workflow-management/classes/Generator.ts(1 hunks)server/src/workflow-management/selector.ts(38 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js
[error] 230-230: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
server/src/workflow-management/selector.ts
[error] 54-54: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 141-141: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 304-304: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 391-391: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 602-602: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 689-689: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 793-793: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 880-880: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 1504-1505: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 1592-1592: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 1947-1948: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 2035-2035: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 2321-2322: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 2409-2409: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 2748-2749: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 2765-2766: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 2884-2884: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
🔇 Additional comments (8)
server/src/workflow-management/classes/Generator.ts (1)
828-828: Add corresponding tests forrelSelectorusage.This entry extends the fallback chain with
relSelector. Ensure the new selector type is accurately tested and verified to work under real scenarios to maintain consistent selector generation.maxun-core/src/browserSide/scraper.js (2)
228-252: Ensure comprehensive frame handling validation.This new block decisively handles elements in frames by name attribute, falling back to direct selectors if needed. While the overall approach looks robust, confirm that cross-origin frames or unusual document states do not cause unexpected errors. Incorporating additional tests or error handling for null or undefined
docreferences may improve resilience.🧰 Tools
🪛 Biome (1.9.4)
[error] 230-230: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
570-574: Prevent potential null dereferences after accessing frame content.When transitioning into iframe/frame contexts (
currentElement.contentDocument.body), if that body is null or unavailable, the function immediately returnsnull. Verify that this short-circuit does not mask partial data or hamper further processing; you might want to log the frame’s origin or name for easier debugging.maxun-core/src/interpret.ts (2)
290-294: Approve flexible empty-array matching.Allowing an empty array for
urlorselectorsto satisfy the condition can be versatile. Confirm that this new logic does not enable unintended fall-through matches in certain workflow states.
600-647: Evaluate extended retry mechanism for button selection.The new
findWorkingButtonfunction introduces robust retries and updated selectors upon repeated failures. While beneficial, confirm that frequent attempts and 10-second timeouts do not unduly inflate runtimes, especially if many selectors are invalid. Additionally, consider logging partial progress or providing feedback to ensure debugging clarity when multiple selectors are discarded.server/src/workflow-management/selector.ts (3)
163-224: AddingisFrameContentand frame-related properties looks good—verify usage.
You’ve introducedisFrameContent,frameURL, andframeIndexalongsideisIframeContent, which can help disambiguate frames vs. iframes. Ensure that all consumers of the returnedelementInfoproperly handle these new properties and that you have test coverage for cases involving<FRAME>.
534-541: Expand test coverage for<SELECT>form fields.
This section handles retrieving text and selected values for<SELECT>elements (and potential date/time inputs). Ensure you have integration tests confirming thatinnerTextand theselectedValueattribute are captured correctly, including error cases (no option selected, empty selects, etc.).Also applies to: 252-254, 535-537
1909-1911: Good addition of'frame'to the DOMContext type.
Broadening the union type to include'frame'clarifies the container handling pattern. This better aligns with code that checksFRAMEvs.IFRAME.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
server/src/workflow-management/selector.ts (5)
52-61: Implement optional chaining for safer node traversal.The
isInFramesetfunction accessesnode.parentElementwithout checking ifnodemight be null, which could lead to potential null pointer errors.const isInFrameset = () => { let node = element; - while (node && node.parentElement) { + while (node?.parentElement) { if (node.tagName === 'FRAMESET' || node.tagName === 'FRAME') { return true; } node = node.parentElement; } return false; };🧰 Tools
🪛 Biome (1.9.4)
[error] 54-54: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
436-436: Fix potential division by zero error.When calculating
significantOverlap, if the parent element has zero width or height, this will result in a division by zero.const significantOverlap = (childRect.width * childRect.height) / - (parentRect.width * parentRect.height) > 0.1; + (parentRect.width * parentRect.height || 1) > 0.1;Also applies to: 928-928
1550-1602: Ensure adequate error handling in frame traversal.The frame traversal logic has appropriate error handling with try/catch blocks, but some areas could be improved.
Remove unnecessary
continuestatements at the end of catch blocks, as they're redundant in a for loop:try { // Frame access code } catch (error) { console.warn('Cannot access frame content:', error); - continue; }🧰 Tools
🪛 Biome (1.9.4)
[error] 1598-1598: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
1759-1769: Add type safety to frame detection.The frame detection could be improved with stronger type checking.
const isInFrameset = () => { let doc = element.ownerDocument; - return doc.querySelectorAll('frameset').length > 0; + return doc && doc.querySelectorAll('frameset').length > 0; };
2747-2776: Consolidate frame access error handling.The error handling for frame access is duplicated in multiple places. Consider refactoring to a common helper function.
+ function safelyAccessFrameDocument(frame: HTMLFrameElement): Document | null { + try { + return frame.contentDocument || frame.contentWindow?.document || null; + } catch (error) { + console.warn('Cannot access frame content:', error); + return null; + } + } // Then replace frame access code blocks with: const frameDoc = safelyAccessFrameDocument(frame); if (frameDoc) { const frameElements = Array.from(frameDoc.querySelectorAll('*')) as HTMLElement[]; children.push(...frameElements); }🧰 Tools
🪛 Biome (1.9.4)
[error] 2757-2758: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 2774-2775: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/workflow-management/selector.ts(39 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
server/src/workflow-management/selector.ts
[error] 54-54: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 141-141: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 307-307: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 394-394: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 605-605: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 692-692: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 799-799: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 886-886: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 1510-1511: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 1598-1598: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 1956-1957: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 2044-2044: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 2330-2331: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 2418-2418: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 2757-2758: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 2774-2775: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 2893-2893: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
🔇 Additional comments (7)
server/src/workflow-management/selector.ts (7)
93-146: Good addition of frame handling support!This implementation properly handles both iframes and frames, enhancing the selector functionality for older websites that might use framesets. The depth limit and frame name detection are excellent additions.
🧰 Tools
🪛 Biome (1.9.4)
[error] 141-141: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
157-160: Keep commented code for link prioritization.While commented code is typically discouraged, this comment explicitly says "DO NOT REMOVE" and appears to be important context for future developers.
197-226: Consistent handling of frame vs iframe attributes.The code carefully distinguishes between iframe and frame elements, assigning appropriate properties (iframeURL vs frameURL) based on the element type. The addition of name attribute detection is particularly valuable for proper frame selection.
1611-1673: Good implementation of frame selector generation.The
genSelectorForFramefunction provides robust handling of frames with appropriate depth limits and fallback strategies. The differentiation between frame and iframe elements is well-implemented.
1857-1860: Good optimization for rel attribute selection.The special case handling for rel attributes is a good optimization, as it creates a more concise selector without having to run the full selector generation algorithm.
2861-2895: Good implementation of frame traversal for child selectors.The frame traversal code for
getChildSelectorsproperly handles the complexity of framesets and nested frames. The error handling is appropriate for cross-origin frames.🧰 Tools
🪛 Biome (1.9.4)
[error] 2893-2893: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
2086-2088:❓ Verification inconclusive
Fix potential off-by-one error in frame indexing.
When determining the frame's position for
:nth-of-typeselector, be cautious about 1-based indexing.
🏁 Script executed:
#!/bin/bash # Check how frames are indexed in the codebase echo "Checking for frame index usage patterns..." rg "nth-of-type" --type tsLength of output: 337
Attention: Verify frame indexing for nth-of-type usage
Please double-check that the value of
frameIndexis correctly aligned with CSS’s 1-based indexing for the:nth-of-typeselector. IfframeIndexoriginates from a 0-based system, increment it by one (i.e., useframeIndex + 1) to avoid off-by-one errors. Ensure that any adjustment here is consistent with the rest of the workflow management logic.
- Location:
server/src/workflow-management/selector.ts(lines 2086-2088)- Snippet to review:
return `${selector}:nth-of-type(${frameIndex})`;
| const isInFrameset = () => { | ||
| let node = element; | ||
| while (node && node.parentElement) { | ||
| if (node.tagName === 'FRAMESET' || node.tagName === 'FRAME') { | ||
| return true; | ||
| } | ||
| node = node.parentElement; | ||
| } | ||
| return false; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor duplicated isInFrameset function.
The isInFrameset function is duplicated across multiple places in the code. This violates the DRY principle and creates maintenance challenges.
Consider extracting this into a utility function that accepts an element parameter:
// At the top of the file, add:
function isElementInFrameset(node: HTMLElement): boolean {
while (node?.parentElement) {
if (node.tagName === 'FRAMESET' || node.tagName === 'FRAME') {
return true;
}
node = node.parentElement;
}
return false;
}
// Then replace all instances with:
const isInFrameset = () => isElementInFrameset(element);Also applies to: 603-612, 797-806, 1508-1517, 1954-1963, 2328-2337
🧰 Tools
🪛 Biome (1.9.4)
[error] 307-307: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
Summary by CodeRabbit