Support for list Field Coverage #391

mrwbarg · 2023-02-28T20:34:55Z

Closes #390.

When a field scraped by a spider is a list containing objects, there's no way to set thresholds for those fields. This PR adds support for correctly couting and calculating the coverage for those types of fields both at the top level of the item and inside nested structures.

codecov · 2023-02-28T20:38:17Z

Codecov Report

Patch coverage: 95.23% and project coverage change: +0.09 🎉

Comparison is base (0cf783f) 76.44% compared to head (2d4489d) 76.54%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #391      +/-   ##
==========================================
+ Coverage   76.44%   76.54%   +0.09%     
==========================================
  Files          76       76              
  Lines        3197     3214      +17     
  Branches      379      384       +5     
==========================================
+ Hits         2444     2460      +16     
  Misses        683      683              
- Partials       70       71       +1

Impacted Files	Coverage Δ
spidermon/contrib/scrapy/monitors/monitors.py	`97.89% <ø> (ø)`
spidermon/contrib/scrapy/extensions.py	`85.57% <90.00%> (+0.16%)`	⬆️
spidermon/utils/field_coverage.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

VMRuiz

I think this feature is interesting, but I'm concerned about the impact that iterating over each array of an item might have on performance. Have you tested this patch by running some Scrapy jobs with complex or larger items to see if they perform well compared to the basic version?

spidermon/contrib/scrapy/extensions.py

mrwbarg · 2023-03-02T13:14:39Z

I think this feature is interesting, but I'm concerned about the impact that iterating over each array of an item might have on performance. Have you tested this patch by running some Scrapy jobs with complex or larger items to see if they perform well compared to the basic version?

I haven't run any test jobs. But this will definitely impact performance. For example, if we assume all items in a job have one field which is a list of m objects and the job scraped n items. The overall complexity for this would be O(mn) while previously it would have been just O(n). If m is big enough, the impact will definitely be felt. Of course, this gets worse if there are more lists inside the objects that are on the topmost list (this can be avoided though).

All the ways I can think of doing this (using a Counter for example) would still require the entire list to be traversed for each yielded item. Do you have any implementations suggestions that might mitigate this issue?

Maybe we can have it as a setting and inform the user of the possible performance impact?

…t_handling

mrwbarg · 2023-03-10T20:07:04Z

Updated it so now it is enabled through a setting and the coverage nesting levels can be set.

Gallaecio · 2023-03-15T15:51:34Z

If I understand correctly, this is disabled by default. If so, we can leave it up to users to decide whether they are willing to enable this at the cost of the corresponding performance hit. Maybe mentioning the potential performance hit in the setting documentation is enough.

mrwbarg · 2023-03-22T14:24:18Z

If I understand correctly, this is disabled by default. If so, we can leave it up to users to decide whether they are willing to enable this at the cost of the corresponding performance hit. Maybe mentioning the potential performance hit in the setting documentation is enough.

yup, you're correct

curita · 2023-04-24T20:59:16Z

Hi all! To make sure, is anything blocking this PR from getting approved? There's a comment in the docs from this PR about the performance impact of enabling this setting. I think that's covered now.

Gallaecio

✔️ from me, I did not realize the new docs already mentioned performance.

mauricio.barg added 3 commits February 28, 2023 15:10

Add logic to count list fields

cef7ed3

Docs and tests

0a0a8a6

formatting

d95ea75

VMRuiz reviewed Mar 1, 2023

View reviewed changes

spidermon/contrib/scrapy/extensions.py Show resolved Hide resolved

VMRuiz requested review from rennerocha and Gallaecio March 1, 2023 08:39

mauricio.barg added 2 commits March 1, 2023 17:09

Remove unused

2cde6fe

Lint

a14338e

mauricio.barg added 4 commits March 9, 2023 14:36

Update logic and add setting

f663af0

Docs

f43babb

Merge remote-tracking branch 'spidermon/master' into item_scraped_lis…

c2a7573

…t_handling

lint docs

87bf573

Gallaecio approved these changes Apr 25, 2023

View reviewed changes

VMRuiz approved these changes Apr 26, 2023

View reviewed changes

Merge branch 'master' into item_scraped_list_handling

2d4489d

VMRuiz merged commit 44d5316 into scrapinghub:master Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for list Field Coverage #391

Support for list Field Coverage #391

mrwbarg commented Feb 28, 2023

codecov bot commented Feb 28, 2023 •

edited

Loading

VMRuiz left a comment

mrwbarg commented Mar 2, 2023 •

edited

Loading

mrwbarg commented Mar 10, 2023

Gallaecio commented Mar 15, 2023

mrwbarg commented Mar 22, 2023

curita commented Apr 24, 2023

Gallaecio left a comment

Support for list Field Coverage #391

Support for list Field Coverage #391

Conversation

mrwbarg commented Feb 28, 2023

codecov bot commented Feb 28, 2023 • edited Loading

Codecov Report

VMRuiz left a comment

Choose a reason for hiding this comment

mrwbarg commented Mar 2, 2023 • edited Loading

mrwbarg commented Mar 10, 2023

Gallaecio commented Mar 15, 2023

mrwbarg commented Mar 22, 2023

curita commented Apr 24, 2023

Gallaecio left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 28, 2023 •

edited

Loading

mrwbarg commented Mar 2, 2023 •

edited

Loading