Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
| def tearDown(self): | ||
| shutil.rmtree(self.tmpdirname) | ||
|
|
||
| def prepare_image_inputs(self): |
There was a problem hiding this comment.
@ydshieh would be great to get your first feedback on this PR (and potentially your help to extend this to all processors).
I think we may need to create separate ImageTextProcessorTesterMixin and AudioTextProcessorTextMin classes given that the former use image processor + tokenizer, whereas the latter use feature extractor + tokenizer.
There was a problem hiding this comment.
I am not in favor of creating 2 classes. Define prepare_inputs(input_type), where input_type could be text, image or audio, and the method just returns what is requested.
ydshieh
left a comment
There was a problem hiding this comment.
Leave some comment regarding a few re-design suggestions.
| tokenizer_class = None | ||
| fast_tokenizer_class = None | ||
| image_processor_class = None |
There was a problem hiding this comment.
I think we could avoid using these attributes. With a given processor_class (one line below), we can already access image_processor_class, tokenizer_class or feature_extractor_class.
| def get_tokenizer(self, **kwargs): | ||
| return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs) |
There was a problem hiding this comment.
We have to consider the case where a processor may use only the fast (or only the slow) tokenizer.
| def get_fast_tokenizer(self, **kwargs): | ||
| return self.fast_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs) | ||
|
|
||
| def get_image_processor(self, **kwargs): |
There was a problem hiding this comment.
It might be good to just have get_components inside which we preare all components of a process.
This way, we don't have to define something like get_image_processor which is not necessary applied to all the processor classes in the library.
| def tearDown(self): | ||
| shutil.rmtree(self.tmpdirname) | ||
|
|
||
| def prepare_image_inputs(self): |
There was a problem hiding this comment.
I am not in favor of creating 2 classes. Define prepare_inputs(input_type), where input_type could be text, image or audio, and the method just returns what is requested.
| tokenizer_slow = self.get_tokenizer() | ||
| tokenizer_fast = self.get_fast_tokenizer() | ||
| image_processor = self.get_image_processor() |
There was a problem hiding this comment.
Could use get_components (suggested in the above comment) to get all the components to instantiate a processor.
The test could be rewritten to handle all the possible component types rather than just tokenizer+image as below.
|
|
||
| processor = self.processor_class(tokenizer=tokenizer, image_processor=image_processor) | ||
|
|
||
| predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]] |
There was a problem hiding this comment.
this expected value will only work for a particular (or a few) class(es).
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
@NielsRogge If you are OK with that, let's put the tests in another PR #27761 and close this one? |
|
Isn't the other PR for saving processors? Or will it also add common tests? |
|
Yes, that is a PR about saving processors. But we want to have tests to make sure it works as we want it. (to avoid later changes of files on the Hub - which is always almost impossible).
Yes, probably with a different version than this PR |
What does this PR do?
Multimodal processors currently don't have common tests. This PR aims to work towards having a common API for our multimodal processors, making sure they all have the same inputs and outputs (e.g. making sure text+vision processors accept
textas first kwarg, thenimages, etc.).As a first work, I refactor the CLIP and BLIP-2 processor tests to leverage the common ones.