Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Propose a minimal specialization for extract #12
base: main
Are you sure you want to change the base?
[RFC] Propose a minimal specialization for extract #12
Changes from all commits
af9d34d
9270f50
c869675
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there could be a big overlap with data structures defined in https://github.com/scrapinghub/web-poet/blob/master/web_poet/page_inputs/http.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I'm thinking if
python-zyte-api
should somehow importweb-poet
to utilize such page_inputs or should the page_inputs themselves be moved outside ofweb-poet
.Later on, we'd need the functionalities of
web_poet.page_inputs.http
to process content-encoding when dealing withbrowserHtml
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll see you and raise you: scrapy-plugins/scrapy-zyte-api#10 may be in a similar situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want
ExtractResult
here to be a subclass of web_poet’s HttpResponse, or expose a compatible interface? Or is the problem only not to reinvent the wheel code-wise? Or do we wantExtractResult
to exposehttpResponseBody
,httpResponseHeaders
andbrowserHtml
through 2 HttpResponse objects, or possibly a different, newweb_poet
object in the case ofbrowserHtml
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's both. We'd want the ecosystem revolving around Zyte's extraction and crawling be similar in interface (i.e.
web-poet
,scrapy-poet
,python-zyte-api
,scrapy-zyte-api
, etc).I'm trying to think if there are some downsides in using
web-poet
for the Zyte API's client, but I can't think of any. I think it's more beneficial sinceweb-poet
would be used behind Zyte API. This means both server and client would use the same dependency promoting overall compatibility.After trying to make
httpResponseBody
work with Text Responses in scrapy-plugins/scrapy-zyte-api#10, I realized that we'll need to read the headers easily.web_poet.httpResponseHeaders
could easily accommodate Zyte API's formatting. For example, if we'd want to determine if a given response is text-based or not, we can check out theContent-Type
header. However, Zyte API returns the headers like this:To search for this value, we'd need to iterate through the list of header key-value pairs. On the other hand,
web-poet
would easily have this as:There's a lot of benefits to using the existing features in
web-poet
similar to this one.For the browserHtml, I think it would be worth representing it with another class, since
web_poet.HttpResponseBody
representsbytes
.