Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for list Field Coverage #391

Merged
merged 10 commits into from
Apr 26, 2023

Conversation

mrwbarg
Copy link
Contributor

@mrwbarg mrwbarg commented Feb 28, 2023

Closes #390.

When a field scraped by a spider is a list containing objects, there's no way to set thresholds for those fields. This PR adds support for correctly couting and calculating the coverage for those types of fields both at the top level of the item and inside nested structures.

@codecov
Copy link

codecov bot commented Feb 28, 2023

Codecov Report

Patch coverage: 95.23% and project coverage change: +0.09 🎉

Comparison is base (0cf783f) 76.44% compared to head (2d4489d) 76.54%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #391      +/-   ##
==========================================
+ Coverage   76.44%   76.54%   +0.09%     
==========================================
  Files          76       76              
  Lines        3197     3214      +17     
  Branches      379      384       +5     
==========================================
+ Hits         2444     2460      +16     
  Misses        683      683              
- Partials       70       71       +1     
Impacted Files Coverage Δ
spidermon/contrib/scrapy/monitors/monitors.py 97.89% <ø> (ø)
spidermon/contrib/scrapy/extensions.py 85.57% <90.00%> (+0.16%) ⬆️
spidermon/utils/field_coverage.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@VMRuiz VMRuiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feature is interesting, but I'm concerned about the impact that iterating over each array of an item might have on performance. Have you tested this patch by running some Scrapy jobs with complex or larger items to see if they perform well compared to the basic version?

@VMRuiz VMRuiz requested review from rennerocha and Gallaecio March 1, 2023 08:39
mauricio.barg added 2 commits March 1, 2023 17:09
@mrwbarg
Copy link
Contributor Author

mrwbarg commented Mar 2, 2023

I think this feature is interesting, but I'm concerned about the impact that iterating over each array of an item might have on performance. Have you tested this patch by running some Scrapy jobs with complex or larger items to see if they perform well compared to the basic version?

I haven't run any test jobs. But this will definitely impact performance. For example, if we assume all items in a job have one field which is a list of m objects and the job scraped n items. The overall complexity for this would be O(mn) while previously it would have been just O(n). If m is big enough, the impact will definitely be felt. Of course, this gets worse if there are more lists inside the objects that are on the topmost list (this can be avoided though).

All the ways I can think of doing this (using a Counter for example) would still require the entire list to be traversed for each yielded item. Do you have any implementations suggestions that might mitigate this issue?

Maybe we can have it as a setting and inform the user of the possible performance impact?

@mrwbarg
Copy link
Contributor Author

mrwbarg commented Mar 10, 2023

Updated it so now it is enabled through a setting and the coverage nesting levels can be set.

@Gallaecio
Copy link
Member

If I understand correctly, this is disabled by default. If so, we can leave it up to users to decide whether they are willing to enable this at the cost of the corresponding performance hit. Maybe mentioning the potential performance hit in the setting documentation is enough.

@mrwbarg
Copy link
Contributor Author

mrwbarg commented Mar 22, 2023

If I understand correctly, this is disabled by default. If so, we can leave it up to users to decide whether they are willing to enable this at the cost of the corresponding performance hit. Maybe mentioning the potential performance hit in the setting documentation is enough.

yup, you're correct

@curita
Copy link
Member

curita commented Apr 24, 2023

Hi all! To make sure, is anything blocking this PR from getting approved? There's a comment in the docs from this PR about the performance impact of enabling this setting. I think that's covered now.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️ from me, I did not realize the new docs already mentioned performance.

@VMRuiz VMRuiz merged commit 44d5316 into scrapinghub:master Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for lists of dictionaries in field coverage rules
4 participants