Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend ItemLoader processors #31

Open
Matthijsy opened this issue Jan 11, 2019 · 6 comments
Open

Extend ItemLoader processors #31

Matthijsy opened this issue Jan 11, 2019 · 6 comments
Labels
enhancement New feature or request

Comments

@Matthijsy
Copy link
Contributor

Currently there are three methods to add ItemLoader processor:

  • The default_input/output_processor on the ItemLoader class
  • The field_name_in/out on the ItemLoader class
  • The input/output_processor on the scrapy.Field

Personally I use the input/output_processor on the scrapy.Field combined with the default_input/output_processor a lot. But I use those in combination. Often I just want to add one more processor after the default processors. Since input/output_processor on scrapy.Field does a override of the defaults this is quite hard to do.
So I would propose to add another method to add a input/output processors. I would like to have something like add_input/output on the scrapy.Field, which would add the specified processor to the default processor.

I did implement this on my own ItemLoader class but think that it would be usefull for the scrapy core. My implementation is as follows (original source: https://github.com/scrapy/scrapy/blob/master/scrapy/loader/__init__.py#L69). Ofcourse this can be added to get_output_processor in the same way.

def get_input_processor(self, field_name):
        proc = getattr(self, '%s_in' % field_name, None)
        if not proc:
            override_proc = self._get_item_field_attr(field_name, 'input_processor')
            extend_proc = self._get_item_field_attr(field_name, 'add_input')
            if override_proc and extend_proc:
                raise ValueError(f'Not allowed to define input_processor and add_input to {field_name}')
            if override_proc:
                return override_proc
            elif extend_proc:
                return Compose(self.default_input_processor, extend_proc)
            return self.default_input_processor
        return proc

I am not sure if add_input is a good name, probably extend_input_processor is more clear but this quite a long name. I would like to hear if more people are wanting this feature and what you all think about what the naming should be.

@Gallaecio
Copy link
Member

I am having a hard time trying to picture what you want done, even after reading the documentation from your pull request.

Could you provide some sample code that shows the (convoluted) way to achieve your goal before your changes, and the (simpler) way to do the same after your suggested changes?

@Matthijsy
Copy link
Contributor Author

Matthijsy commented Mar 26, 2019

We have a custom itemloader which specifies some default input processors. Some fields needs an additional input processor which is specific for this field. Now we need to copy the default input processor and add the new one. We we want to add a new default input processor in the future we have to do this at all those fields again. I would like to be able to keep the default input processor and only extend it with one new input processor.

Example:
We have as default input processor a strip and remove tags function to clean some data.
Now we have a scrapy item representing an person with two fields:

  • Name -> for this field those input processors are fine
  • Height -> This data is in the format 1.50m for example but we want to have it in cm as integer (150). So we want to add an new input processor to do this, but keep the strip and remove_tags input processors

I hope this makes the use case a bit more clear

@Gallaecio
Copy link
Member

So, if I got it right, you are saying that you have:

class MyItemLoader(ItemLoader):
    default_input_processor = some_input_processor
    some_field_in = MapCompose(some_input_processor, another_input_processor)

How would you like that to look like instead?

@Matthijsy
Copy link
Contributor Author

No we don't use the some_field_in method, we use it like this (current style)

class MyItemLoader(ItemLoader):
    default_input_processor = some_input_processor

class User(scrapy.Item):
    name = scrapy.Field()
    height = scrapy.Field(input_processor=MapCompose(some_input_processor, another_input_processor))

But I would like to do it in this style

class MyItemLoader(ItemLoader):
    default_input_processor = some_input_processor

class User(scrapy.Item):
    name = scrapy.Field()
    height = scrapy.Field(add_input=another_input_processor)

This way I don't have to duplicate the some_input_processor

@Matthijsy
Copy link
Contributor Author

@Gallaecio Do you understand the problem now? Or do I need to explain more?

@Gallaecio
Copy link
Member

Gallaecio commented Jun 4, 2019

I think I understand, although I personally don’t like coupling Item’s Field class and the ItemLoader class that way.

@Gallaecio Gallaecio transferred this issue from scrapy/scrapy Oct 30, 2020
@Gallaecio Gallaecio added the enhancement New feature or request label Feb 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants