Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rel-id xpath ext #100

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Sep 6, 2017

This PR adds an xpathfunc that performs relative id lookups.

There are two ways of doing those:

  • sel.xpath('id("foo")') under the hood performs a dictionary lookup and thus is blazingly fast, however there's no way to limit the nodeset to search in.
  • with sel.xpath('//*[@id="foo"]') one can limit the nodeset the way they like, however it has to traverse all the matching nodes, and thus is a lot slower

rel-id function, presented in this PR, attempts to achieve some middle ground: it does the id lookup under the hood, but then checks the result to be in the specified nodeset, i.e. all following statements return the same results:

sel.xpath('rel-id("foo", //div)')
sel.xpath('//div').xpath('rel-id("foo")')
sel.xpath('id("foo")[ancestor::div]')
sel.xpath('id("foo")[set:intersection(ancestor::*, //div)]')
sel.xpath('//div/*[@id="foo"]')

Naturally, it's a Python-level xpathfunc, so "native" solutions that involve id and ancestor are faster, but it's still more performant than [@id="foo"] and .css("div #foo") (that expands to [@id="foo"]):

sel.css("#masthead")                                                    0.971  1.000
sel.xpath("//*[@id='masthead']")                                        1.186  1.221
sel.xpath("id('masthead')")                                             0.032  0.033
sel.xpath("rel-id('masthead')")                                         0.051  0.053


sel.css("#shell #masthead")                                             2.162  1.000
sel.xpath("//*[@id='shell']//*[@id='masthead']")                        2.257  1.044
sel.xpath("id('shell')//*[@id='masthead']")                             1.147  0.531
sel.xpath("id('masthead')[ancestor::*[@id='shell']]")                   0.039  0.018
sel.xpath("id('masthead')[set:intersection(ancestor::*, id('shell'))]")  0.037  0.017
sel.xpath("rel-id('masthead', id('shell'))")                            0.055  0.025
sel.xpath("id('shell')").xpath("rel-id('masthead')")                    0.090  0.041


sel.css("div #masthead")                                               12.127  1.000
sel.xpath("id('masthead')[ancestor::div]")                              0.035  0.003
sel.xpath("rel-id('masthead', //div)")                                  0.558  0.046
sel.xpath("//div").xpath("rel-id('masthead')")                         17.939  1.479


sel.xpath("id('masthead')[set:intersection(ancestor::*, (//div|//span))]")  0.248  1.000
sel.xpath("rel-id('masthead', (//div|//span))")                         0.670  2.700
sel.xpath("//div|//span").xpath("rel-id('masthead')")                  19.825 79.845

The benchmark is available here. Also, sel.xpath("//div").xpath("rel-id('masthead')") and sel.xpath("//div|//span").xpath("rel-id('masthead')") are very slow because of the number of items for which rel-id is invoked.

One particular situation when rel-id is helpful, is when you pre-select a subset of the document and then look in its descendants:

sel2 = sel.xpath('id("foo")')
sel2.xpath('rel-id("bar")')

The sel2.xpath('id("bar")[set:intersection(ancestor::*, .)]') approach won't work here, because the dot inside the square brackets already means id("bar") rather than id("foo").

@codecov
Copy link

codecov bot commented Sep 6, 2017

Codecov Report

Merging #100 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master   #100   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           5      5           
  Lines         248    265   +17     
  Branches       46     51    +5     
=====================================
+ Hits          248    265   +17
Impacted Files Coverage Δ
parsel/xpathfuncs.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68d64db...eb7edc4. Read the comment docs.

@immerrr
Copy link
Contributor Author

immerrr commented Sep 6, 2017

Another slightly subtle application of the context node being current node by default:

sel.xpath('//div[rel-id("some-id")]')

which means select a div, that contains an element with id="some-id"

@Gallaecio
Copy link
Member

If we decide to merge this, we should probably update the documentation first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants