Skip to content

Adding json + xpath headless extractors#6559

Merged
dogancanbakir merged 2 commits intodevfrom
feat-6359-json-xpath-headless-extractor
Oct 29, 2025
Merged

Adding json + xpath headless extractors#6559
dogancanbakir merged 2 commits intodevfrom
feat-6359-json-xpath-headless-extractor

Conversation

@Mzack9999
Copy link
Member

@Mzack9999 Mzack9999 commented Oct 28, 2025

Proposed changes

Closes #6359

Checklist

  • Pull request is created against the dev branch
  • All checks passed (lint, unit/integration/regression tests etc.) with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Summary by CodeRabbit

  • New Features

    • Added XPath support for extracting data elements from HTML content.
    • Added JSON support for extracting data using JSONPath-like query syntax.
  • Tests

    • Added extensive unit tests for data extraction and matching operations.
    • Includes tests for complex nested HTML and JSON structures.
    • Tests cover multiple extraction methods and data source types.

@Mzack9999 Mzack9999 added the Type: Enhancement Most issues will probably ask for additions or changes. label Oct 28, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

Walkthrough

Added support for XPathExtractor and JSONExtractor types in the headless protocol's Extract method by routing them to ExtractXPath() and ExtractJSON() functions respectively. Comprehensive test coverage validates extraction, matching, and handling of complex nested structures.

Changes

Cohort / File(s) Summary
Headless Operators Implementation
pkg/protocols/headless/operators.go
Added two new cases in the Extract method's switch statement: XPathExtractor routing to ExtractXPath(itemStr) and JSONExtractor routing to ExtractJSON(itemStr), enabling XPath and JSON extraction capabilities in headless mode.
Headless Operators Tests
pkg/protocols/headless/operators_test.go
Added comprehensive test suite covering HTML extraction via XPath (text content, attributes, multiple items, non-existent paths), JSON extraction via JSONPath (IDs, names, nested values, emails, invalid JSON), XPath/JSON-based matching across data parts (default, named parts), and complex nested structures (JSON APIs, HTML commerce scenarios).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • The operator implementation change is minimal—two new switch cases routing to pre-existing extraction functions with no novel logic.
  • Test coverage is extensive but follows consistent, repetitive patterns across similar scenarios.
  • Primary review focus: verifying test scenarios comprehensively exercise both extraction types and match the feature requirements.

Poem

🐰 Headless hops with XPath and JSON so fine,
No HTML left unturned, no data out of line,
Extractors now aligned across all the modes,
Through nested paths and complex loads,
The rabbit's toolkit grows—what joy it bodes! 🌟

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Out of Scope Changes Check ⚠️ Warning The PR includes implementation of JSONExtractor support alongside XPathExtractor, but issue #6359 specifically requests only XPath extractor support for headless mode. While JSONExtractor is closely related to the main feature and serves a complementary purpose within the same operators file, it represents an out-of-scope addition since the linked issue makes no mention of JSON extraction requirements. The JSON extractor feature, though adjacent and useful, extends beyond the explicitly stated requirements for the PR. Consider either limiting the PR scope to XPath extractor support only (as requested in issue #6359) and deferring JSON extractor implementation to a separate feature request, or alternatively, create a corresponding issue for the JSON extractor enhancement to bring it formally into scope. This would ensure clear alignment between the PR changes and linked issues.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Adding json + xpath headless extractors" is clear, concise, and accurately reflects the main changes in the pull request. It specifically identifies the two new extractor types being added (JSON and XPath) and their context (headless mode), which directly aligns with the PR's objective to enhance headless protocol extraction capabilities. The title is descriptive enough for developers scanning history to understand the primary change.
Linked Issues Check ✅ Passed The PR successfully addresses the primary objective from linked issue #6359, which requests enabling the XPath extractor in headless mode. The code changes in operators.go add support for both XPathExtractor and JSONExtractor, with the former directly satisfying the requirement. Comprehensive unit tests in operators_test.go demonstrate that XPath extraction and matching work correctly in headless mode across various scenarios (text content, attributes, nested structures). The main requirement has been met, allowing users to employ XPath extraction in headless mode the same way as in HTTP mode.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat-6359-json-xpath-headless-extractor

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82144e5 and e535e01.

📒 Files selected for processing (2)
  • pkg/protocols/headless/operators.go (1 hunks)
  • pkg/protocols/headless/operators_test.go (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Format Go code using go fmt
Run static analysis with go vet

Files:

  • pkg/protocols/headless/operators.go
  • pkg/protocols/headless/operators_test.go
pkg/protocols/**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

Each protocol implementation must provide a Request interface with methods Compile(), ExecuteWithResults(), Match(), and Extract()

Files:

  • pkg/protocols/headless/operators.go
  • pkg/protocols/headless/operators_test.go
🧬 Code graph analysis (2)
pkg/protocols/headless/operators.go (1)
pkg/operators/extractors/extractor_types.go (2)
  • XPathExtractor (21-21)
  • JSONExtractor (23-23)
pkg/protocols/headless/operators_test.go (4)
pkg/operators/extractors/extractors.go (1)
  • Extractor (11-116)
pkg/operators/extractors/extractor_types.go (4)
  • ExtractorTypeHolder (71-73)
  • ExtractorType (12-12)
  • XPathExtractor (21-21)
  • JSONExtractor (23-23)
pkg/operators/matchers/matchers.go (1)
  • Matcher (10-138)
pkg/operators/matchers/matchers_types.go (3)
  • MatcherTypeHolder (77-79)
  • MatcherType (12-12)
  • XPathMatcher (29-29)
🔇 Additional comments (2)
pkg/protocols/headless/operators.go (1)

79-82: LGTM! Clean implementation of XPath and JSON extractors.

The implementation correctly adds support for XPathExtractor and JSONExtractor types, following the established pattern of existing extractors. Both cases appropriately route to their respective extraction methods and maintain consistency with the codebase.

pkg/protocols/headless/operators_test.go (1)

1-566: Excellent test coverage for XPath and JSON extractors!

The test suite is comprehensive and well-structured, covering:

  • Basic extraction scenarios for both XPath and JSON
  • Attribute extraction and multiple item handling
  • Edge cases (invalid JSON, non-existent paths)
  • Different response parts (data, header, history)
  • Complex nested structures (API responses, e-commerce HTML)
  • XPath matching functionality

All tests follow Go testing conventions and provide thorough validation of the new extractor functionality in headless mode.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Mzack9999 Mzack9999 marked this pull request as ready for review October 28, 2025 19:50
@auto-assign auto-assign bot requested a review from dogancanbakir October 28, 2025 19:50
@dogancanbakir dogancanbakir merged commit 3be27b9 into dev Oct 29, 2025
20 checks passed
@dogancanbakir dogancanbakir deleted the feat-6359-json-xpath-headless-extractor branch October 29, 2025 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Type: Enhancement Most issues will probably ask for additions or changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Enable xpath extractor with headless mode

2 participants