-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] text extraction in Selector and SelectorList #127
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #127 +/- ##
==========================================
- Coverage 99.63% 98.19% -1.44%
==========================================
Files 5 5
Lines 271 277 +6
Branches 48 49 +1
==========================================
+ Hits 270 272 +2
- Misses 1 5 +4
Continue to review full report at Codecov.
|
parsel/selector.py
Outdated
""" | ||
if isinstance(cleaner, six.string_types): | ||
if cleaner not in {'html', 'text'}: | ||
raise ValueError("cleaner must be 'html', 'text' or " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one gotcha: this exception is raised in .get as well, but in .get there are two more accepted values: "auto" and None. Does it worth fixing?
parsel/selector.py
Outdated
if cleaner == 'html': | ||
cleaner = self._html_cleaner | ||
elif cleaner == 'text': | ||
cleaner = self._text_cleaner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an alternative is make these attributes public, and ask users to pass them: sel.cleaned(sel.TEXT_CLEANER)
instead of sel.cleaned('text')
.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #127 +/- ##
==========================================
- Coverage 92.18% 90.88% -1.30%
==========================================
Files 5 5
Lines 448 472 +24
Branches 91 99 +8
==========================================
+ Hits 413 429 +16
- Misses 26 30 +4
- Partials 9 13 +4 ☔ View full report in Codecov by Sentry. |
# Conflicts: # parsel/selector.py # setup.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me so far.
Selector text tests
The assertion was wrong
# Conflicts: # parsel/selector.py # tests/test_selector.py
Co-authored-by: Adrián Chaves <[email protected]>
Selector text doc
# Conflicts: # parsel/selector.py # tests/test_selector.py # tests/typing/selector.py # tox.ini
still 👍 from far away 🚢 |
To extract all text of one or more element and all their child elements, | ||
formatted as plain text taking into account HTML tags (e.g. ``<br/>`` is | ||
translated as a line break), set ``text=True`` in your call to | ||
:meth:`~parsel.selector.Selector.get` or | ||
:meth:`~parsel.selector.Selector.getall` instead of including | ||
``::text`` (CSS) or ``/text()`` (XPath) in your query:: | ||
|
||
>>> selector.css('#images').get(text=True) | ||
'Name: My image 1\nName: My image 2\nName: My image 3\nName: My image 4\nName: My image 5' | ||
|
||
See :meth:`Selector.get` for additional parameters that you can use to change | ||
how the extracted plain text is formatted. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like for many use cases .get(text=True)
could provide more reasonable behavior than /text()
or ::text
in a selector. From this point of view, I wonder if we should make it one of the first examples, and review many other examples as well. But it seems we can also do it separately, not as a part of this PR, so I'm not working on it.
I've opened it for a discussion, it is not a finished solution (yet), but something one can install any try the API. See #34 for original proposal.
Here
options to extract text are added for Selector and SelectorList, using https://github.com/TeamHG-Memex/html-text library..text()
methodsProblems:
[RESOLVED] Naming issue. In scrapy it can be convenient to have
response.text()
shortcut, to use it instead ofresponse.css('body').text()
orresponse.selector.text()
. But we already have response.text, which is unicode body. It makes this feature more confusing - selector's .text() methods are very different from response.text attribute. From this point of view, .text_content() name sounds better. Any ideas for a shorter / alternative name? UPD: resolved by making text conversion .get argument[RESOLVED] There is a circular package dependency: html_text requires parsel, and parsel requires html_text. This is not a problem code-wise, but I haven't checked how well pip can handle it. In a basic case it seems to work, but I wonder if we get issues related to this. It can be solved by changing html_text API and making its parsel dependency optional. UPD: this is fixed in Remove parsel dependency TeamHG-Memex/html-text#15
[RESOLVED] parsel imports private html_text methods. This can be solved by changing html_text API. UPD: fixed at Remove parsel dependency TeamHG-Memex/html-text#15
[RESOLVED] Cleaning is called for each Selector.text() call. So e.g. in case of
sel.css('div').text()
each div will be cleaned and copied - instead of cleaning a tree once. I'm not sure how large is this problem tough; probably it is inefficient when you need to extract text from nested elements (e.g. from all elements) - it means cleaning will be run multiple times on same parts of the tree, makingsel.xpath("*").text()
O(N^2) instead of O(N). Alternative solution is to havesel.cleaned().text()
or something like this;.cleaned()
may allow lxml Cleaner arguments. But it looks like a separate feature; also, Cleaner parameters which work best with html-text are not default lxml's. UPD: there is .cleaned() method which supports different Cleaners, O(N^2) caveat is mentioned in the docs.[RESOLVED] When user requests sel.text() from an element which is removed by Cleaner (e.g.
sel.css('script')[0].text()
, None is returned. Should it be an empty string? UPD: we (me and @dangra) think None is fine.[RESOLVED] SelectorList.text() joins text. This is similar to what's proposed in Add method that allows joining the extracted result into a string scrapy#772, but different from SelectorList.get, which returns the first element. If needed, we can support both behaviors, by allowing sep=None, and probably using it by default (or join=None if we rename 'sep' argument to 'join'), meaning "don't join, take first" - or would it be too confusing? UPD: SelectorList no longer joins text; as there is text extraction support in .getall, it is easy to join text on user side.
[RESOLVED] Joining in SelectorList.text can be confusing if SelectorList selects nested elements. UPD: SelectorList no longer joins text.
TODO: