Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce ZyteAPITextResponse and ZyteAPIResponse to store raw Zyte Data API Response #10

Merged
merged 30 commits into from
May 30, 2022

Conversation

BurnzZ
Copy link
Member

@BurnzZ BurnzZ commented Apr 26, 2022

Fixes #9


The approach of storing the Zyte API response inside the response.meta was avoided since it will break the expectation in Scrapy that response.meta comes directly from response.request.meta, as what this code snippet embodies:

@property
def meta(self):
    try:
        return self.request.meta
    except AttributeError:
        raise AttributeError(
            "Response.meta not available, this response "
            "is not tied to any request"
        )

In addition, this is what the official Scrapy docs says about response.meta:

Unlike the Response.request attribute, the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.

Thus, this PR introduces these new classes which inherits from Scrapy's Response and TextResponse respectively:

  • ZyteAPITextResponse
  • ZyteAPIResponse

This enables users to retrieve the raw Zyte Data API response as a dict via response.zyte_api_response.

TODO:

  • more tests
  • update README
  • update CHANGELOG

@BurnzZ BurnzZ requested a review from kmike April 26, 2022 13:07
@codecov
Copy link

codecov bot commented Apr 26, 2022

Codecov Report

Merging #10 (f5a9bb0) into main (001e201) will increase coverage by 0.74%.
The diff coverage is 98.55%.

@@            Coverage Diff             @@
##             main      #10      +/-   ##
==========================================
+ Coverage   91.95%   92.70%   +0.74%     
==========================================
  Files           2        3       +1     
  Lines          87      137      +50     
==========================================
+ Hits           80      127      +47     
- Misses          7       10       +3     
Impacted Files Coverage Δ
scrapy_zyte_api/handler.py 88.63% <95.23%> (-3.23%) ⬇️
scrapy_zyte_api/responses.py 100.00% <100.00%> (ø)

README.rst Outdated Show resolved Hide resolved
scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved
scrapy_zyte_api/responses.py Outdated Show resolved Hide resolved
scrapy_zyte_api/responses.py Outdated Show resolved Hide resolved
@BurnzZ BurnzZ marked this pull request as ready for review April 28, 2022 12:15
tests/test_responses.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Contributor

I was going to suggest changing the class names, because eventually we probably want to allow non-browser responses to be TextResponse subclasses (#15).

However, now I am thinking this naming is OK, and what needs to change to address #15 is only that we use ZyteAPITextResponse for non-browser responses as well when those are plain-text responses.

I imagine there is a case for users wanting to know if the response content comes from browser or non-browser data, but I don’t think having different classes so that this can be checked with isinstance is the right way to go. If anyone requests such a feature, maybe we could add one more property for that (e.g. response.source == "browserHtml").

@kmike
Copy link
Member

kmike commented May 25, 2022

hey! I think that here we should focus on exposing raw API response, as JSON. Auto-filling httpResponseHeaders also looks out of scope, it's a separate discussion.

@Gallaecio's questions:

ZyteAPI*Response directly maps browserHtml if available, else httpResponseBody, as proposed. In the future maybe we can allow users to override this behavior through a plugin-specific zyte_api meta subkey.

This sounds good to me. I'm not sure we would even need to override this behavior in future.

Whether ZyteAPIResponse or ZyteAPITextResponse does not depend on whether the body comes from browserHtml or httpResponseBody, but whether the response is a text response or not.

That's correct. Though browserHtml responses are always text responses.

ZyteAPI*Response.api (rename suggestion for .zyte_api_response) is an instance of ExtractResult as proposed in zytedata/python-zyte-api#12, which is a dict with the API response plus some helpers.

I'd keep it out of scope here. We can add more helpers later, and discuss it separately.

ZyteAPIResponse.http and ZyteAPIResponse.browser are regular Scrapy response objects (which class depends on their content and headers) mapping the corresponding content from Zyte Data API, or None when not received. They would be lazy-loaded, so that if you do not use .http it never gets even base64-decoded.

I think that's fine not to implement it now, and focus on returning the raw data.

ZyteAPI*Response.replace works on the top-level response data, but .http and .browser remain unchanged. (I have not thought too hard about this one, maybe there are scenarios where this is problematic)

Let's keep this out of scope.

What do you think about implementing the following behavior?

  1. For any Zyte API response user should be able to access the raw returned data (decoded from JSON) as .zyte_api_response attribute.
  2. If there is browserHtml in response, user should be able to run css/xpath selectors, and the response type should be a subclass of TextResponse.
  3. When only httpResponseBody is in the output (but not httpResponseHeaders), user can't run selectors, and response type is not a subclass of TextResponse.
  4. When both httpResponseBody and httpResponseHeaders are in response, there should be some kind of content type detection (probably relying on default Scrapy's). If data is text, response type should be a subclass of TextResponse, and user should be able to run css/xpath selectors. I guess (4) is what can be tricky to implement; if so, some alternative can be explored for (4).

@Gallaecio
Copy link
Contributor

Sounds good to me. I do think 3 could be eventually handled like 4, since content detection is possible without headers, only not ideal, but even then it can be handled separately later.

@BurnzZ
Copy link
Member Author

BurnzZ commented May 25, 2022

Sounds good to me as well. I've got an idea on how to implement point 4 but testing it with a lot of difference cases is indeed quite tricky. I'll have an update tomorrow after the test cases are sorted out, to see if we're okay with approach.

@BurnzZ
Copy link
Member Author

BurnzZ commented May 26, 2022

For any Zyte API response user should be able to access the raw returned data (decoded from JSON) as .zyte_api_response attribute.

Yes, this one is already implemented. Albeit, .zyte_api_response has been renamed to .zyte_api to prevent redundancies in naming like response.zyte_api_response.

If there is browserHtml in response, user should be able to run css/xpath selectors, and the response type should be a subclass of TextResponse.

This is implemented but I've added some tests in 910085b. Thanks for the reminder!

When only httpResponseBody is in the output (but not httpResponseHeaders), user can't run selectors, and response type is not a subclass of TextResponse.
When both httpResponseBody and httpResponseHeaders are in response, there should be some kind of content type detection (probably relying on default Scrapy's). If data is text, response type should be a subclass of TextResponse, and user should be able to run css/xpath selectors. I guess (4) is what can be tricky to implement; if so, some alternative can be explored for (4).

Implemented this on e3214d8

README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
CHANGES.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
scrapy_zyte_api/responses.py Outdated Show resolved Hide resolved
tests/test_responses.py Outdated Show resolved Hide resolved
README.rst Outdated
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
key to download a request using Zyte API. Full list of parameters is provided in the
`Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
To enable every request to be sent through Zyte API, you can set the following
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the setting below makes every request to be sent through Zyte API. If I'm reading the code correctly, it first checks if zyte_api meta key is present and the value is true-ish, and uses ZYTE_API_DEFAULT_PARAMS only in this case. So, parameters are only applied for requests which are already marked explicitly as Zyte API requests.

That's actually the behavior I'm fine with :) It looks useful e.g. to set default geolocation for Zyte API requests made from a spider. On the other hand, making every request going through Zyte API by using this option looks quite problematic; it'd require more thought. For example,

  • would this example mean that robots.txt files will be downloaded through Zyte API with browserHtml: True?
  • would it mean that e.g. a POST request won't work properly, while looking like a regular request in the spider code, and without raising any kind of warning?

It seems some feature to "enable Zyte API transparently" makes sense after we put a bit more effort into "transparent" integration of scrapy Requests with Zyte API direct downloader, it doesn't look as straightforward as setting default parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the setting below makes every request to be sent through Zyte API. If I'm reading the code correctly, it first checks if zyte_api meta key is present and the value is true-ish, and uses ZYTE_API_DEFAULT_PARAMS only in this case. So, parameters are only applied for requests which are already marked explicitly as Zyte API requests.

That's right. #13 attempted to not change the behavior fully. It only addressed to prevent the user from repeating the same Zyte API parameters in different parts of the code. In any case, hopefully it should serve as a base point to enable all requests go through Zyte API. :)

would this example mean that robots.txt files will be downloaded through Zyte API with browserHtml: True?

That's a good point. This also extends to other things like downloading an image, file, etc.

I think one option is to prevent browserHtml from being set in the ZYTE_API_DEFAULT_PARAMS. The same is true for javascript, product, articles, etc. I think one way to think of it is having ZYTE_API_DEFAULT_PARAMS support only a handful of parameters, like geolocation or httpResponseBody (though it needs to toggled off when browserHtml is turned on), etc.

This could be a bit tricky since Zyte API offers a lot of features that doesn't only cater to being a simply proxy. I think for users to have a seamless experience when using Zyte API, we should have an interface that could cover general Scrapy use. This could mean only downloading the httpResponseBody and perhaps httpResponseHeaders by default which is available to all scrapy.Request unless overridden by the zyte_api param in meta.

We can open up another PR to explore the different options here.

would it mean that e.g. a POST request won't work properly, while looking like a regular request in the spider code, and without raising any kind of warning?

I think this could be alleviated somehow by the proposed interface in #20.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @kmike’s main point was that the documentation needs an update, as it is currently quite misleading. The other points are about issues if we decided to make the implementation match the current documentation, which I don’t think we want, at least not at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for clarifying! What do you think about this doc update? 2adc8a6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation change looks great.

I do wonder if we should cause zyte_api: {} to enable Zyte Data API with default parameters. I think that is what I would expect to happen as a user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm a good point. There could be a few cases for this:

  1. The user is aware that ZYTE_API_DEFAULT_PARAMS is set in the settings and wish to use exactly the same values. Thus, zyte_api would be represented as an empty {}.
  2. ZYTE_API_DEFAULT_PARAMS is not in the settings. The user might've thought that it was set but was actually not. This would result in the requests to be executed without using Zyte API.

We could use meta={"zyte_api": {}} to represent it but it kinda feels unnatural as an empty-dict. Perhaps we could also support meta={"zyte_api": True} as another alternative. The {} is still a valid use-case though since the user might have a dynamic process to determine the contents of meta field. For example:

meta= {"zyte_api": {}}  # default
if does_it_look_like_we_need_javascript():
    meta["zyte_api"].update({"javascript": True})
if how_about_using_fr_region():
    meta["zyte_api"].update({"geolocation": "FR"})
yield scrapy.Request(url, self.callback_func, meta=meta)

To summarize, we could use {} and True to enable Zyte API Requests bearing in mind the need for ZYTE_API_DEFAULT_PARAMS be set.

Any thoughts on this?

Copy link
Contributor

@Gallaecio Gallaecio May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to having {} and True work and behave the same. At some point Zyte Data API may (should? 🙂) work without parameter other than the URL (e.g. consider browserHtml or httpResponseBody enabled if no other output is enabled), in which case zyte_api=True may become a common pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the only remaining fix to make, the PR looks good to me otherwise 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the code to accept True and {} in f5a9bb0.

CHANGES.rst Outdated Show resolved Hide resolved
scrapy_zyte_api/responses.py Outdated Show resolved Hide resolved
tests/test_responses.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@kmike kmike merged commit a188fcc into main May 30, 2022
@kmike
Copy link
Member

kmike commented May 30, 2022

🎉 thanks @BurnzZ! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow to access api_response
4 participants