Speed regression in ItemLoader since 1.0 #35

juraseg · 2018-02-15T11:39:11Z

Hi

TLDR: ItemLoader does not reuse response.selector when you are passing response to it as argument. And looks like it was reusing it up to Scrapy==1.0

Recently we were trying to upgrade our setup to use newer version of Scrapy (we are on 1.0.5).
We noticed huge slowdown in a lot of our spiders.

I dag down a bit and found that the bottleneck was ItemLoader, in particular when you create many ItemLoaders (one for each Item) passing response to every one of them and the response size is relatively big.

The speed drop down goes back to version 1.1. So on Scrapy==1.1 and above the performance degradation is present.

Test spider code (testfile is just a random file of 1MB size):

# -*- coding: utf-8 -*-
import os.path

import scrapy
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field

HERE = os.path.abspath(os.path.dirname(__file__))


class Product(Item):
    url = Field()
    name = Field()


class TestSpeedSpider(scrapy.Spider):
    name = 'test_speed'

    def start_requests(self):
        file_path = HERE + '/testfile'

        file_url = 'file://' + file_path
        yield scrapy.Request(file_url)

    def parse(self, response):
        for i in xrange(0, 300):
            loader = ItemLoader(item=Product(), response=response)
            loader.add_value('name', 'item {}'.format(i))
            loader.add_value('url', 'http://site.com/item{}'.format(i))

            product = loader.load_item()

            yield product

The text was updated successfully, but these errors were encountered:

ghost · 2018-02-16T13:33:30Z

Hi @juraseg - thanks for raising this! I'm looking back over the history to get familiar with how these objects were instantiated before, and now. Lots of ground to cover. But, it looks like you could deliberately provide a selector instance to the constructor for ItemLoader, and sidestep the issue entirely. Is this something you've explored?

ghost · 2018-02-16T14:56:07Z

As an experiment, I added a line to the ItemLoader constructor that would check response._cached_selector is not None and would use it if present. On Py2.7 and Py3.5 locally, it seems to pass tests well enough. So, this ought to be fixable, but we should probably discuss where is the most appropriate place to re-use selectors in this way. If implemented incautiously, then it might clobber custom selector subclasses silently and unexpectedly.

kmike · 2018-02-16T15:15:04Z

I believe regression happened when lxml document cache was removed, as a part of scrapy/scrapy#1409. The idea was to use response.selector explicitly; we fixed it for link extractors (https://github.com/scrapy/scrapy/pull/1409/files#diff-952f1c77a1cfacb08eed58f08daa8870R67), but not for item loaders.

Item loaders are a bit different because they alow to customize selector class, this is where we should be careful implementing a fix.

Accessing response._cached_selector is not None doesn't look clean as a final solution, as we'll be accessing a private attribute. Even if there is no cached selector yet, it makes sense to use response.selector (which caches transparently) - this way if some further code needs a selector, body won't be parsed one more time.

juraseg · 2018-02-16T16:07:24Z

Hi guys

@cathalgarvey Yes, I tried providing selector instead of response and it does the job. So really a matter of reusing selector from response.

If that helps, I did what @kmike suggests for our case to solve the issue - created custom loader, which checks if there is response.selector. To workaround the issue with custom selector this loader checks if response.selector.__class__ is equal to loader's default_selector_class.

The class check looks a bit clunky, but I have not found other way to check (isinstance won't work as it will return True for subclasses as well, which is not what we need here).

kmike · 2018-02-16T18:20:29Z

@juraseg I think your approach is fine, but there is an extra tricky part: when you access response.selector.__class__, response body could be parsed in order to fill the "selector" attribute for the first time. If a class is not the same as ItemLoader's default_selector_class, response is parsed again. The second one is inevitable, but parsing response with a wrong selector just to get its type looks unnecessary.

juraseg · 2018-02-17T06:37:21Z

@kmike
Ah, I see what you mean.

So the only way to avoid second parsing is to know which class the would be response.selector is and compare to it instead of accessing response.selector. For example response object could have selector_class read-only property.

malloxpb · 2018-03-08T07:29:13Z

Hey guys, I tried implementing the modification to ItemLoader constructor. So when there is response.selector available, we will use that instead of creating a new Selector. The result for my sample code has shown that the performance is equal to that of Scrapy 1.0. However, it seems like the discussion is still continuing so I am just wondering if you guys can tell me more about the potential outcome if I do something shown on this diff scrapy/scrapy#3157

Gallaecio transferred this issue from scrapy/scrapy Oct 30, 2020

Gallaecio added the enhancement New feature or request label Feb 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed regression in ItemLoader since 1.0 #35

Speed regression in ItemLoader since 1.0 #35

juraseg commented Feb 15, 2018

ghost commented Feb 16, 2018

ghost commented Feb 16, 2018

kmike commented Feb 16, 2018

juraseg commented Feb 16, 2018

kmike commented Feb 16, 2018 •

edited

Loading

juraseg commented Feb 17, 2018

malloxpb commented Mar 8, 2018 •

edited

Loading

Speed regression in ItemLoader since 1.0 #35

Speed regression in ItemLoader since 1.0 #35

Comments

juraseg commented Feb 15, 2018

ghost commented Feb 16, 2018

ghost commented Feb 16, 2018

kmike commented Feb 16, 2018

juraseg commented Feb 16, 2018

kmike commented Feb 16, 2018 • edited Loading

juraseg commented Feb 17, 2018

malloxpb commented Mar 8, 2018 • edited Loading

kmike commented Feb 16, 2018 •

edited

Loading

malloxpb commented Mar 8, 2018 •

edited

Loading