introduce ZyteAPITextResponse and ZyteAPIResponse to store raw Zyte Data API Response #10

BurnzZ · 2022-04-26T13:07:58Z

Fixes #9

The approach of storing the Zyte API response inside the response.meta was avoided since it will break the expectation in Scrapy that response.meta comes directly from response.request.meta, as what this code snippet embodies:

@property
def meta(self):
    try:
        return self.request.meta
    except AttributeError:
        raise AttributeError(
            "Response.meta not available, this response "
            "is not tied to any request"
        )

In addition, this is what the official Scrapy docs says about response.meta:

Unlike the Response.request attribute, the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.

Thus, this PR introduces these new classes which inherits from Scrapy's Response and TextResponse respectively:

ZyteAPITextResponse
ZyteAPIResponse

This enables users to retrieve the raw Zyte Data API response as a dict via response.zyte_api_response.

TODO:

more tests
update README
update CHANGELOG

codecov · 2022-04-26T13:09:39Z

Codecov Report

Merging #10 (f5a9bb0) into main (001e201) will increase coverage by 0.74%.
The diff coverage is 98.55%.

@@            Coverage Diff             @@
##             main      #10      +/-   ##
==========================================
+ Coverage   91.95%   92.70%   +0.74%     
==========================================
  Files           2        3       +1     
  Lines          87      137      +50     
==========================================
+ Hits           80      127      +47     
- Misses          7       10       +3

Impacted Files	Coverage Δ
scrapy_zyte_api/handler.py	`88.63% <95.23%> (-3.23%)`	⬇️
scrapy_zyte_api/responses.py	`100.00% <100.00%> (ø)`

README.rst

scrapy_zyte_api/handler.py

scrapy_zyte_api/responses.py

…g lost

tests/test_responses.py

remove 'Content-Encoding' header when returning responses

Gallaecio · 2022-05-13T07:09:25Z

I was going to suggest changing the class names, because eventually we probably want to allow non-browser responses to be TextResponse subclasses (#15).

However, now I am thinking this naming is OK, and what needs to change to address #15 is only that we use ZyteAPITextResponse for non-browser responses as well when those are plain-text responses.

I imagine there is a case for users wanting to know if the response content comes from browser or non-browser data, but I don’t think having different classes so that this can be checked with isinstance is the right way to go. If anyone requests such a feature, maybe we could add one more property for that (e.g. response.source == "browserHtml").

kmike · 2022-05-25T13:39:09Z

hey! I think that here we should focus on exposing raw API response, as JSON. Auto-filling httpResponseHeaders also looks out of scope, it's a separate discussion.

@Gallaecio's questions:

ZyteAPI*Response directly maps browserHtml if available, else httpResponseBody, as proposed. In the future maybe we can allow users to override this behavior through a plugin-specific zyte_api meta subkey.

This sounds good to me. I'm not sure we would even need to override this behavior in future.

Whether ZyteAPIResponse or ZyteAPITextResponse does not depend on whether the body comes from browserHtml or httpResponseBody, but whether the response is a text response or not.

That's correct. Though browserHtml responses are always text responses.

ZyteAPI*Response.api (rename suggestion for .zyte_api_response) is an instance of ExtractResult as proposed in zytedata/python-zyte-api#12, which is a dict with the API response plus some helpers.

I'd keep it out of scope here. We can add more helpers later, and discuss it separately.

ZyteAPIResponse.http and ZyteAPIResponse.browser are regular Scrapy response objects (which class depends on their content and headers) mapping the corresponding content from Zyte Data API, or None when not received. They would be lazy-loaded, so that if you do not use .http it never gets even base64-decoded.

I think that's fine not to implement it now, and focus on returning the raw data.

ZyteAPI*Response.replace works on the top-level response data, but .http and .browser remain unchanged. (I have not thought too hard about this one, maybe there are scenarios where this is problematic)

Let's keep this out of scope.

What do you think about implementing the following behavior?

For any Zyte API response user should be able to access the raw returned data (decoded from JSON) as .zyte_api_response attribute.
If there is browserHtml in response, user should be able to run css/xpath selectors, and the response type should be a subclass of TextResponse.
When only httpResponseBody is in the output (but not httpResponseHeaders), user can't run selectors, and response type is not a subclass of TextResponse.
When both httpResponseBody and httpResponseHeaders are in response, there should be some kind of content type detection (probably relying on default Scrapy's). If data is text, response type should be a subclass of TextResponse, and user should be able to run css/xpath selectors. I guess (4) is what can be tricky to implement; if so, some alternative can be explored for (4).

Gallaecio · 2022-05-25T15:19:26Z

Sounds good to me. I do think 3 could be eventually handled like 4, since content detection is possible without headers, only not ideal, but even then it can be handled separately later.

BurnzZ · 2022-05-25T15:23:12Z

Sounds good to me as well. I've got an idea on how to implement point 4 but testing it with a lot of difference cases is indeed quite tricky. I'll have an update tomorrow after the test cases are sorted out, to see if we're okay with approach.

BurnzZ · 2022-05-26T07:50:31Z

For any Zyte API response user should be able to access the raw returned data (decoded from JSON) as .zyte_api_response attribute.

Yes, this one is already implemented. Albeit, .zyte_api_response has been renamed to .zyte_api to prevent redundancies in naming like response.zyte_api_response.

If there is browserHtml in response, user should be able to run css/xpath selectors, and the response type should be a subclass of TextResponse.

This is implemented but I've added some tests in 910085b. Thanks for the reminder!

When only httpResponseBody is in the output (but not httpResponseHeaders), user can't run selectors, and response type is not a subclass of TextResponse.
When both httpResponseBody and httpResponseHeaders are in response, there should be some kind of content type detection (probably relying on default Scrapy's). If data is text, response type should be a subclass of TextResponse, and user should be able to run css/xpath selectors. I guess (4) is what can be tricky to implement; if so, some alternative can be explored for (4).

Implemented this on e3214d8

README.rst

CHANGES.rst

README.rst

scrapy_zyte_api/responses.py

tests/test_responses.py

kmike · 2022-05-26T15:30:14Z

README.rst

-<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
-key to download a request using Zyte API. Full list of parameters is provided in the
-`Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
+To enable every request to be sent through Zyte API, you can set the following


I'm not sure the setting below makes every request to be sent through Zyte API. If I'm reading the code correctly, it first checks if zyte_api meta key is present and the value is true-ish, and uses ZYTE_API_DEFAULT_PARAMS only in this case. So, parameters are only applied for requests which are already marked explicitly as Zyte API requests.

That's actually the behavior I'm fine with :) It looks useful e.g. to set default geolocation for Zyte API requests made from a spider. On the other hand, making every request going through Zyte API by using this option looks quite problematic; it'd require more thought. For example,

would this example mean that robots.txt files will be downloaded through Zyte API with browserHtml: True?

would it mean that e.g. a POST request won't work properly, while looking like a regular request in the spider code, and without raising any kind of warning?

It seems some feature to "enable Zyte API transparently" makes sense after we put a bit more effort into "transparent" integration of scrapy Requests with Zyte API direct downloader, it doesn't look as straightforward as setting default parameters.

I'm not sure the setting below makes every request to be sent through Zyte API. If I'm reading the code correctly, it first checks if zyte_api meta key is present and the value is true-ish, and uses ZYTE_API_DEFAULT_PARAMS only in this case. So, parameters are only applied for requests which are already marked explicitly as Zyte API requests.

That's right. #13 attempted to not change the behavior fully. It only addressed to prevent the user from repeating the same Zyte API parameters in different parts of the code. In any case, hopefully it should serve as a base point to enable all requests go through Zyte API. :)

would this example mean that robots.txt files will be downloaded through Zyte API with browserHtml: True?

That's a good point. This also extends to other things like downloading an image, file, etc.

I think one option is to prevent browserHtml from being set in the ZYTE_API_DEFAULT_PARAMS. The same is true for javascript, product, articles, etc. I think one way to think of it is having ZYTE_API_DEFAULT_PARAMS support only a handful of parameters, like geolocation or httpResponseBody (though it needs to toggled off when browserHtml is turned on), etc.

This could be a bit tricky since Zyte API offers a lot of features that doesn't only cater to being a simply proxy. I think for users to have a seamless experience when using Zyte API, we should have an interface that could cover general Scrapy use. This could mean only downloading the httpResponseBody and perhaps httpResponseHeaders by default which is available to all scrapy.Request unless overridden by the zyte_api param in meta.

We can open up another PR to explore the different options here.

would it mean that e.g. a POST request won't work properly, while looking like a regular request in the spider code, and without raising any kind of warning?

I think this could be alleviated somehow by the proposed interface in #20.

I think @kmike’s main point was that the documentation needs an update, as it is currently quite misleading. The other points are about issues if we decided to make the implementation match the current documentation, which I don’t think we want, at least not at this point.

I see, thanks for clarifying! What do you think about this doc update? 2adc8a6

The documentation change looks great.

I do wonder if we should cause zyte_api: {} to enable Zyte Data API with default parameters. I think that is what I would expect to happen as a user.

Hmm a good point. There could be a few cases for this:

The user is aware that ZYTE_API_DEFAULT_PARAMS is set in the settings and wish to use exactly the same values. Thus, zyte_api would be represented as an empty {}.

ZYTE_API_DEFAULT_PARAMS is not in the settings. The user might've thought that it was set but was actually not. This would result in the requests to be executed without using Zyte API.

We could use meta={"zyte_api": {}} to represent it but it kinda feels unnatural as an empty-dict. Perhaps we could also support meta={"zyte_api": True} as another alternative. The {} is still a valid use-case though since the user might have a dynamic process to determine the contents of meta field. For example:

meta= {"zyte_api": {}} # default if does_it_look_like_we_need_javascript(): meta["zyte_api"].update({"javascript": True}) if how_about_using_fr_region(): meta["zyte_api"].update({"geolocation": "FR"}) yield scrapy.Request(url, self.callback_func, meta=meta)

To summarize, we could use {} and True to enable Zyte API Requests bearing in mind the need for ZYTE_API_DEFAULT_PARAMS be set.

Any thoughts on this?

+1 to having {} and True work and behave the same. At some point Zyte Data API may (should? 🙂) work without parameter other than the URL (e.g. consider browserHtml or httpResponseBody enabled if no other output is enabled), in which case zyte_api=True may become a common pattern.

This seems to be the only remaining fix to make, the PR looks good to me otherwise 👍

Refactored the code to accept True and {} in f5a9bb0.

Co-authored-by: Adrián Chaves <[email protected]>

tests/test_responses.py

CHANGES.rst

scrapy_zyte_api/responses.py

tests/test_responses.py

…lace()

Gallaecio

Great work!

kmike · 2022-05-30T11:43:53Z

🎉 thanks @BurnzZ! 🚀

create ZyteAPITextResponse and ZyteAPIResponse

9a83471

BurnzZ requested a review from kmike April 26, 2022 13:07

update README and CHANGES with notes on new response classes

8909473