Add DetrImageProcessorFast#34063
Conversation
molbap
left a comment
There was a problem hiding this comment.
Very nice work! I added a couple comments, in addition to one also on the page you linked rel to performance of .to(device).to(dtype) VS .to(device, dtype), finding the second one to be always faster when moving to a cuda device, fwiw
There was a problem hiding this comment.
This method doesn't have any tensor <-> array change right?
| # Copied from transformers.models.detr.image_processing_detr.get_size_with_aspect_ratio |
There was a problem hiding this comment.
Yes indeed. I'm not sure if I should import those functions directly from image_processing_detr.py instead of using # Copied from at this point, wdyt?
There was a problem hiding this comment.
We typically don't import from another file, but it'd be nice to do it yes - in general we should be able to import from an ...utils file (say processing_utils.py, image_processing_utils.py, image_processing_utils_fast.py ) more generic methods instead of redefining them for each model. cc @ArthurZucker for this pattern type as we discussed it before
There was a problem hiding this comment.
Yep I think it's good to think about how we can refactor a bit!
There was a problem hiding this comment.
and btw (for later) these methods should be slightly 🧹 🧹 rel to variable naming for oh + ow
There was a problem hiding this comment.
we can use the deprecate_kwarg util here
There was a problem hiding this comment.
I agree but using it causes the test_image_processor_preprocess_arguments to fail. I'm guessing this tests does not work well with the way decorators calls the function and the arguments
There was a problem hiding this comment.
Dove a bit here, nice refactor and improvement - I think if we can move to a more modular way of building processors, as we try to do with models, it'll reduce loc count greatly and improve readability. A modular_image_processing_detr_fast that inherits from the necessary funcs and then build automatically the image_processing_detr_fast would be amazing for instance, cc @ArthurZucker for visibility (not a requirement on this PR imo).
There was a problem hiding this comment.
Yes I was thinking about this too, because the loc count is huge compare to the actual changes. I might be wrong but it looks like something similar was done to reduce loc count for fast tokenizers? Not sure who to ping here for clarifications
There was a problem hiding this comment.
Yeah! Kind of like we have a PreTrainedModel we can have a ImageProcessorMixin that would be a bit more powerfull!
For fast tokenizers, we just rely on the tokenizers library directly, so it's a bit different
There was a problem hiding this comment.
It's interesting the git diff is so large here - unless I'm wrong we are generalizing the tests to integrate a list of processors, fast and slow, but the tests do not change?
There was a problem hiding this comment.
Yes I'm guessing it's because of the change in indentation, but all I changed is testing a list of processors instead of only one
91f050d to
d4ed4ee
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM in general, but having 1600 LOC for an image processor is quite intense!
Great performance gains tho
There was a problem hiding this comment.
Yeah! Kind of like we have a PreTrainedModel we can have a ImageProcessorMixin that would be a bit more powerfull!
For fast tokenizers, we just rely on the tokenizers library directly, so it's a bit different
There was a problem hiding this comment.
Yep I think it's good to think about how we can refactor a bit!
There was a problem hiding this comment.
This is kind of void of copied from, which I am a bit suprised about!
There was a problem hiding this comment.
Most methods are very similar to the ones in image_processing_detr, but slightly changed to work with tensors instead of numpy arrays. I haven't changed the post-processing methods at all (as they are not as much of a speed bottleneck), so they are all added with copied from (end of the file)
4f6d93f to
8f57e14
Compare
8f57e14 to
69bde86
Compare
* add fully functionning image_processing_detr_fast * Create tensors on the correct device * fix copies * fix doc * add tests equivalence cpu gpu * fix doc en * add relative imports and copied from * Fix copies and nit
|
@yonigozlan Hi, super excited to see this. just left two questions
|
|
Hi @SangbumChoi !
|
|
@yonigozlan thanks! I would love support this task if you need some help. Also if you can share the comparison script for compile + visualize plot etc... it would be very helpful! |
* add fully functionning image_processing_detr_fast * Create tensors on the correct device * fix copies * fix doc * add tests equivalence cpu gpu * fix doc en * add relative imports and copied from * Fix copies and nit
What does this PR do?
Adds a fast image processors for DETR. Follows issue #33810.
This image processor is a result of this work on comparing different image processing method.
The processing methods use only torchvision transforms (either v1 or v2, depending on the torchvision version) and torch tensors.
Just like the current DETR image processor, this processor can also process object detection or segmentation annotations. This processing also uses only torch tensors and torchvision transforms.
The post-processing methods have not been modified from the original image processor.
Implementation
A previous fast image processor implementation for VIT (link) uses torchvision transform classes and
Composeto create a one step processing class. However this poses two problems:So this implementation uses the functional forms of torchvision transforms, and it's structure is very similar to the current DETR image processor.
All the numpy/PIL operations have been converted to torch or torchvision operations, and like the VIT fast image processor, this processor only accept
return_tensors = "pt"The processor call function accept a
devicekwarg, as processing can be performed on both CPU and GPU, but is much faster on GPU.I wanted to add device as an
initargument, but that would make the signatures of fast and slow processors different, which make some tests fails.Usage
Except for the fact that it only returns torch tensors, this fast processor is fully compatible with the current one.
It can be instantiated through AutoImageProcessor with use_fast=True, or through the Class directly:
Usage is the same as the current processor, except for the
devicekwarg:If
deviceis not specified:Performance gains
Main Takeaways
Processing speedup
Inference pass speedup (GPU)
batch_size=8. Padding needed, as the images have different sizes, and the DETR processor resize them using "shortest_edge"/"longest_edge", resulting in different sized resized images.batch_size=8. Forcing padding to 1333x1333 (="longest_edge"), as otherwise torch.compile needs to recompile if the different batches have different max sizes.(I'm not sure what is going wrong when using the compiled model with the current processor)
batch_size=1. Forcing padding to 1333x1333 for comparison with batched inputsTests
Looking forward to your feedback!
I was also wondering if we should adopt a more modular approach to the fast image processors, as there is quite a lot of repetition with the "slow" processor for now. It looks like something like this was done for Fast tokenizers? If someone that worked on Fast tokenizers has any advice on that I'll gladly hear them 🤗.
There will also be the question of how to advertise this "use_fast" option to users, and if we want to make it default eventually when torchvision is available?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.