Support Zorba as an alternative XML/HTML processing engine #29

gerosalesc · 2016-02-18T15:42:34Z

This has been troubling me for some time now but I would like this project to support a more powerful XML/HTML processing engine as an alternative to Lxml. The only contender for lxml in Python: Zorba. But why?

Zorba supports XQuery technology as well as JSONiq.
Zorba has Python bindings. I know they are not precisely the best bindings ever but at least they exist.
I think XPath 1.0 is very limited for more complex structures.
Lxml extensions are ok but not that much when compared to XQuery capabilities by default.
Zorba can be hosted as a service.

Ideally, we should be able to use selectors with Zorba in this way:

Selector(response=response).xquery('...').extract()
or
response.selector.xquery('...').extract()

The text was updated successfully, but these errors were encountered:

eliasdorneles · 2016-04-26T19:24:15Z

Hello @gerosalesc !

So, to be fair I don't see lxml going away anytime soon, but this looks like a nice optional addition.

I'm not really familiar with Zorba nor its bindings, but this seems worth a proof-of-concept.
Could you please point me to some use cases when supporting XQuery would give the biggest benefits?

Thank you!

gerosalesc · 2016-05-06T16:48:21Z

@eliasdorneles Hi there buddy. I have found myself in need of some of the features of XQuery when trying to do serious stuff to get the value from high complex HTML pages.

Let's say for example the FLWOR syntax, that alone would allow us to sort the values of a list of elements, not to mention that you can actually get more complex structures returned and perform some interesting data comparisons and transformations with functions of XPATH 2.0 which is supported by XQuery by default.

I understand that we are highly coupled but I think this change would take this library to a whole world of new possibilities.

For a PoC I see myself using XQuilla bindings because is seems to be easier. BTW you guys should consider XQuilla as well as Zorba.

Gallaecio · 2019-09-24T17:34:44Z

https://github.com/28msec/zorba seems dead, should we close this?

gerosalesc mentioned this issue Feb 18, 2016

Alternatives to Lxml as XML processing engine scrapy/scrapy#1784

Closed

eliasdorneles mentioned this issue Oct 4, 2017

Add format_as to extract() methods #101

Closed

aschey mentioned this issue Jun 30, 2018

Allow scrapers to use better html parsing methods than Scrapy's basic css and xpath 1.0 selectors In2ItChicago/In2ItChicago#24

Closed

Gallaecio added the enhancement label May 9, 2019

Gallaecio changed the title ~~Alternatives to Lxml as XML processing engine~~ Support Zorba as an alternative XML/HTML processing engine May 9, 2019

Gallaecio added the needs more info label Sep 24, 2019

Gallaecio closed this as completed Oct 29, 2019

barrio mentioned this issue Apr 30, 2024

Parsel import causes crash #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Zorba as an alternative XML/HTML processing engine #29

Support Zorba as an alternative XML/HTML processing engine #29

gerosalesc commented Feb 18, 2016

eliasdorneles commented Apr 26, 2016

gerosalesc commented May 6, 2016 •

edited

Loading

Gallaecio commented Sep 24, 2019

Support Zorba as an alternative XML/HTML processing engine #29

Support Zorba as an alternative XML/HTML processing engine #29

Comments

gerosalesc commented Feb 18, 2016

eliasdorneles commented Apr 26, 2016

gerosalesc commented May 6, 2016 • edited Loading

Gallaecio commented Sep 24, 2019

gerosalesc commented May 6, 2016 •

edited

Loading