Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
b64d760
init refactor
yonigozlan Jan 27, 2026
43ce3c9
Fix llava
yonigozlan Jan 27, 2026
75a5f8b
Merge remote-tracking branch 'upstream/main' into refactor-improc-bac…
yonigozlan Jan 27, 2026
94e2f2c
changes after review
yonigozlan Feb 2, 2026
34aa802
update first batch of image processors
yonigozlan Feb 5, 2026
21ff306
refactor part 2
yonigozlan Feb 11, 2026
de23b46
improve base image processor class, move backends to separate file
yonigozlan Feb 11, 2026
1d64b21
Merge remote-tracking branch 'upstream/main' into refactor-improc-bac…
yonigozlan Feb 11, 2026
56c49c8
refactor to have backends in separate files, with backends now inheri…
yonigozlan Feb 15, 2026
ad91d8e
fix docstrings
yonigozlan Feb 17, 2026
87dc38b
update some image processors to new refactored standards
yonigozlan Feb 17, 2026
780c0cb
refactor more image processors
yonigozlan Feb 18, 2026
526a668
refactor more image processors
yonigozlan Feb 18, 2026
85207f2
refactor more fast image processors
yonigozlan Feb 20, 2026
e1c56dc
refactor more image processors
yonigozlan Feb 20, 2026
be2d888
refactor more image processor
yonigozlan Feb 24, 2026
ed57337
improve compatibility with video processors
yonigozlan Feb 24, 2026
fbf6be5
refactor more image processors
yonigozlan Feb 25, 2026
6efda3a
add more image processors, improve compatibility with video processors
yonigozlan Feb 25, 2026
97f0dd9
support for modular
yonigozlan Feb 25, 2026
1da0c11
refactor modular ima proc
yonigozlan Feb 26, 2026
f27bce9
refactor more modular image processors
yonigozlan Feb 26, 2026
5135b2b
adjustments before merge
yonigozlan Feb 27, 2026
5641c8e
Merge remote-tracking branch 'upstream/main' into refactor-improc-bac…
yonigozlan Feb 27, 2026
71dd63f
fimish image processors refactor
yonigozlan Mar 3, 2026
6a738de
update docs
yonigozlan Mar 3, 2026
f652f5a
add fallback to Pil backend for backward compat
yonigozlan Mar 3, 2026
80bcfe0
fix repo
yonigozlan Mar 3, 2026
d13f843
Fix all processors and image processors tests
yonigozlan Mar 3, 2026
7d52916
fix modular and style
yonigozlan Mar 3, 2026
6d46b06
fix docs
yonigozlan Mar 3, 2026
0f918df
fix remote code backward compatibility + super in lists
yonigozlan Mar 4, 2026
5b8155d
Merge remote-tracking branch 'upstream/main' into refactor-improc-bac…
yonigozlan Mar 4, 2026
7502afb
Update docs and add new model like cli
yonigozlan Mar 4, 2026
bd4ccc6
fix processor tests
yonigozlan Mar 4, 2026
db34bb3
relax test tvp (used to be skipped)
yonigozlan Mar 4, 2026
3e0a72b
fix 4 channels oneformer
yonigozlan Mar 4, 2026
f5bbf0e
Changes after review
yonigozlan Mar 16, 2026
4a33f7f
Merge remote-tracking branch 'upstream/main' into refactor-improc-bac…
yonigozlan Mar 16, 2026
ae06795
Fixes after review
yonigozlan Mar 16, 2026
a6d7002
Fix tests
yonigozlan Mar 16, 2026
04ef71f
Change imports in modeling tests to minimize integration tests changes
yonigozlan Mar 16, 2026
9b1d3a8
Merge remote-tracking branch 'upstream/main' into refactor-improc-bac…
yonigozlan Mar 16, 2026
ed5a4df
fix wrong import
yonigozlan Mar 16, 2026
304e4b0
Merge remote-tracking branch 'upstream/main' into refactor-improc-bac…
yonigozlan Mar 17, 2026
f42220e
fix import and missing doc
yonigozlan Mar 17, 2026
22037d2
fix typo PI0
yonigozlan Mar 17, 2026
272fdca
Merge remote-tracking branch 'upstream/main' into refactor-improc-bac…
yonigozlan Mar 17, 2026
777551d
Fix all integration tests
yonigozlan Mar 17, 2026
3b9b589
Fix after review, enforce protected torch/torchvision imports in pil …
yonigozlan Mar 18, 2026
b12a20d
Fix style
yonigozlan Mar 18, 2026
9fc7900
Fix test modeling depth pro
yonigozlan Mar 18, 2026
01fb2f1
Fix processing_idefics
yonigozlan Mar 18, 2026
fcd2335
Merge branch 'main' into refactor-improc-backends
yonigozlan Mar 18, 2026
fd54fd7
Merge remote-tracking branch 'upstream/main' into refactor-improc-bac…
yonigozlan Mar 19, 2026
a094cef
Fixes after merge
yonigozlan Mar 19, 2026
33dfa8c
_rescale_and_normalize -> rescale_and_normalize
yonigozlan Mar 19, 2026
8c13796
fix-repo
yonigozlan Mar 19, 2026
7b8d547
Merge branch 'main' into refactor-improc-backends
yonigozlan Mar 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
14 changes: 8 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,13 +137,15 @@ python utils/modular_model_converter.py <model_name>

This will generate the separate files (`modeling_*.py`, `configuration_*.py`, etc.) from your modular file. The CI will enforce that these generated files match your modular file.

☐ **2. Add a fast image processor (for image models)**
☐ **2. Add image processors (for image models)**

If your model processes images, implement a fast image processor that uses `torch` and `torchvision` instead of PIL/numpy for better inference performance:
If your model processes images, implement both a torchvision-backed processor (the default, GPU-accelerated) and a PIL-backed processor (the alternative):

- See the detailed guide in [#36978](https://github.com/huggingface/transformers/issues/36978)
- Fast processors inherit from `BaseImageProcessorFast`
- Examples: `LlavaOnevisionImageProcessorFast`, `Idefics2ImageProcessorFast`
- The torchvision backend processor (`<Model>ImageProcessor`) inherits from `TorchvisionBackend` and lives in `image_processing_<model>.py`
- The PIL backend processor (`<Model>ImageProcessorPil`) inherits from `PilBackend` and lives in `image_processing_pil_<model>.py`
- Both are imported from `image_processing_backends`; the PIL kwargs class is defined in the torchvision file and imported by the PIL file
- See the detailed guide in [IMAGE_PROCESSOR_REFACTORING_GUIDE.md](https://github.com/huggingface/transformers/blob/main/IMAGE_PROCESSOR_REFACTORING_GUIDE.md)
- Examples: `CLIPImageProcessor` / `CLIPImageProcessorPil`, `DonutImageProcessor` / `DonutImageProcessorPil`

☐ **3. Create a weight conversion script**

Expand Down Expand Up @@ -225,7 +227,7 @@ Here's a condensed version maintainers can copy into PRs:
Please ensure your PR completes all following items. See the [full checklist](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#vision-language-model-contribution-checklist) for details.

- [ ] **Modular file**: `modular_<model_name>.py` implemented and verified with `python utils/modular_model_converter.py <model_name>`
- [ ] **Fast image processor**: Implemented using `BaseImageProcessorFast` (see [#36978](https://github.com/huggingface/transformers/issues/36978))
- [ ] **Image processors**: Torchvision backend (`<Model>ImageProcessor` from `TorchvisionBackend`) and PIL backend (`<Model>ImageProcessorPil` from `PilBackend`) both implemented (see [IMAGE_PROCESSOR_REFACTORING_GUIDE.md](https://github.com/huggingface/transformers/blob/main/IMAGE_PROCESSOR_REFACTORING_GUIDE.md))
- [ ] **Conversion script**: `convert_<model_name>_to_hf.py` added with usage examples
- [ ] **Integration tests**: End-to-end tests with exact output matching (text or logits)
- [ ] **Documentation**: Model docs added/updated in `docs/source/en/model_doc/`
Expand Down
14 changes: 5 additions & 9 deletions docs/source/en/add_new_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -544,20 +544,16 @@ When both implementations have the same `input_ids`, add a tokenizer test file.
## Implement image processor

> [!TIP]
> Fast image processors use the [torchvision](https://pytorch.org/vision/stable/index.html) library and can perform image processing on the GPU, significantly improving processing speed.
> We recommend adding a fast image processor ([`BaseImageProcessorFast`]) in addition to the "slow" image processor ([`BaseImageProcessor`]) to provide users with the best performance. Feel free to tag [@yonigozlan](https://github.com/yonigozlan) for help adding a [`BaseImageProcessorFast`].
> Image processors now use a backend-based architecture. The default backend is [`TorchvisionBackend`], which uses the [torchvision](https://pytorch.org/vision/stable/index.html) library and can perform image processing on the GPU. A PIL/NumPy alternative backend ([`PilBackend`]) is also provided. Both backends are imported from `image_processing_backends`. Feel free to tag [@yonigozlan](https://github.com/yonigozlan) for help.

While this example doesn't include an image processor, you may need to implement one if your model requires image inputs. The image processor is responsible for converting images into a format suitable for your model. Before implementing a new one, check whether an existing image processor in the Transformers library can be reused, as many models share similar image processing techniques. Note that you can also use [modular](./modular_transformers) for image processors to reuse existing components.

If you do need to implement a new image processor, refer to an existing image processor to understand the expected structure. Slow image processors ([`BaseImageProcessor`]) and fast image processors ([`BaseImageProcessorFast`]) are designed differently, so make sure you follow the correct structure based on the processor type you're implementing.
If you do need to implement a new image processor, each model has two processor files:

Run the following command (only if you haven't already created the fast image processor with the `transformers add-new-model-like` command) to generate the necessary imports and to create a prefilled template for the fast image processor. Modify the template to fit your model.
- `image_processing_<model>.py`: the **default** torchvision-backed processor (`<Model>ImageProcessor`), inheriting from [`TorchvisionBackend`]. This replaces the old "fast" processor.
- `image_processing_pil_<model>.py`: the PIL/NumPy alternative processor (`<Model>ImageProcessorPil`), inheriting from [`PilBackend`]. This replaces the old "slow" processor.

```bash
transformers add-fast-image-processor --model-name your_model_name
```

This command will generate the necessary imports and provide a pre-filled template for the fast image processor. You can then modify it to fit your model's needs.
The torchvision backend file also defines any custom kwargs class that the PIL file imports. Both files use the `@auto_docstring` decorator β€” do not add manual class docstrings. Refer to the [IMAGE_PROCESSOR_REFACTORING_GUIDE.md](https://github.com/huggingface/transformers/blob/main/IMAGE_PROCESSOR_REFACTORING_GUIDE.md) for a step-by-step walkthrough and complete examples.

Add tests for the image processor in `tests/models/your_model_name/test_image_processing_your_model_name.py`. These tests should be similar to those for other image processors and should verify that the image processor correctly handles image inputs. If your image processor includes unique features or processing methods, ensure you add specific tests for those as well.

Expand Down
51 changes: 30 additions & 21 deletions docs/source/en/image_processors.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ rendered properly in your Markdown viewer.

# Image processors

Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
Image processors convert images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.

- [`~BaseImageProcessor.center_crop`] to resize an image
- [`~BaseImageProcessor.normalize`] or [`~BaseImageProcessor.rescale`] pixel values
- center-crop or resize an image
- normalize or rescale pixel values

Use [`~ImageProcessingMixin.from_pretrained`] to load an image processors configuration (image size, whether to normalize and rescale, etc.) from a vision model on the Hugging Face [Hub](https://hf.co) or local directory. The configuration for each pretrained model is saved in a [preprocessor_config.json](https://huggingface.co/google/vit-base-patch16-224/blob/main/preprocessor_config.json) file.

Expand All @@ -44,70 +44,79 @@ This guide covers the image processor class and how to preprocess images for vis

## Image processor classes

Image processors inherit from the [`BaseImageProcessor`] class which provides the [`~BaseImageProcessor.center_crop`], [`~BaseImageProcessor.normalize`], and [`~BaseImageProcessor.rescale`] functions. There are two types of image processors.
Image processors use a backend-based architecture with two backends:

- [`BaseImageProcessor`] is a Python implementation.
- [`BaseImageProcessorFast`] is a faster [torchvision-backed](https://pytorch.org/vision/stable/index.html) version. For a batch of [torch.Tensor](https://pytorch.org/docs/stable/tensors.html) inputs, this can be up to 33x faster. [`BaseImageProcessorFast`] is not available for all vision models at the moment. Refer to a models API documentation to check if it is supported.
- [`TorchvisionBackend`] β€” the default [torchvision-backed](https://pytorch.org/vision/stable/index.html) implementation. GPU-accelerated and up to 33x faster than the PIL backend for batches of [torch.Tensor](https://pytorch.org/docs/stable/tensors.html) inputs. All models support this backend; newer models only support this one.
- [`PilBackend`] β€” the PIL/NumPy alternative. Portable and CPU-only. Only available for older models, where it is useful to reproduce the exact numerical outputs of the original implementation.

Each image processor subclasses the [`ImageProcessingMixin`] class which provides the [`~ImageProcessingMixin.from_pretrained`] and [`~ImageProcessingMixin.save_pretrained`] methods for loading and saving image processors.
The active backend on a loaded processor can be inspected with its `backend` attribute (e.g., `processor.backend == "torchvision"`). Each image processor subclasses [`ImageProcessingMixin`] which provides the [`~ImageProcessingMixin.from_pretrained`] and [`~ImageProcessingMixin.save_pretrained`] methods.

There are two ways you can load an image processor, with [`AutoImageProcessor`] or a model-specific image processor.
There are two ways you can load an image processor: with [`AutoImageProcessor`] or directly from a model-specific class.

<hfoptions id="image-processor-classes">
<hfoption id="AutoImageProcessor">

The [AutoClass](./model_doc/auto) API provides a convenient method to load an image processor without directly specifying the model the image processor is associated with.

Use [`~AutoImageProcessor.from_pretrained`] to load an image processor, and set `use_fast=True` to load a fast image processor if it's supported.
Use [`~AutoImageProcessor.from_pretrained`] with the `backend` argument to select the backend. When `backend` is omitted (the default), torchvision is picked when it is installed and PIL is used otherwise. Note that `backend="pil"` is only supported for older models; newer models only expose the torchvision backend.

> **Note:** a small set of older models (Chameleon, Flava, Idefics3, SmolVLM) use Lanczos interpolation that torchvision does not support, so they always default to the PIL backend regardless of torchvision availability. Pass `backend="torchvision"` explicitly to override this.

```py
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224", use_fast=True)
# Default: picks torchvision if available, otherwise pil
image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

# Explicitly request the torchvision backend
image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224", backend="torchvision")

# Explicitly request the PIL backend (only for models that support it)
image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224", backend="pil")
```

</hfoption>
<hfoption id="model-specific image processor">

Each image processor is associated with a specific pretrained vision model, and the image processors configuration contains the models expected size and whether to normalize and resize.
Each image processor is associated with a specific pretrained vision model, and its configuration contains the model's expected size and normalization parameters.

The image processor can be loaded directly from the model-specific class. Check a models API documentation to see whether it supports a fast image processor.
Load the torchvision backend processor directly from the model-specific class.

```py
from transformers import ViTImageProcessor

image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
```

To load a fast image processor, use the fast implementation class.
For models that support it, you can load the PIL backend with the `Pil`-suffixed class. This is useful when you need exact numerical parity with the original implementation.

```py
from transformers import ViTImageProcessorFast
from transformers import ViTImageProcessorPil

image_processor = ViTImageProcessorFast.from_pretrained("google/vit-base-patch16-224")
image_processor = ViTImageProcessorPil.from_pretrained("google/vit-base-patch16-224")
```

</hfoption>
</hfoptions>

## Fast image processors
## Torchvision backend processors

[`BaseImageProcessorFast`] is based on [torchvision](https://pytorch.org/vision/stable/index.html) and is significantly faster, especially when processing on a GPU. This class can be used as a drop-in replacement for [`BaseImageProcessor`] if it's available for a model because it has the same design. Make sure [torchvision](https://pytorch.org/get-started/locally/#mac-installation) is installed, and set the `use_fast` parameter to `True`.
[`TorchvisionBackend`] is the **default** backend. Make sure [torchvision](https://pytorch.org/get-started/locally/#mac-installation) is installed, then load it with `backend="torchvision"` (or simply omit `backend`, since torchvision is selected automatically when available).

```py
from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True)
processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", backend="torchvision")
```

Control which device processing is performed on with the `device` parameter. Processing is performed on the same device as the input by default if the inputs are tensors, otherwise they are processed on the CPU. The example below places the fast processor on a GPU.
Control which device processing is performed on with the `device` argument. Processing is performed on the same device as the input by default if the inputs are tensors, otherwise it falls back to CPU. The example below runs processing on a GPU.

```py
from torchvision.io import read_image
from transformers import DetrImageProcessorFast
from transformers import DetrImageProcessor

images = read_image("image.jpg")
processor = DetrImageProcessorFast.from_pretrained("facebook/detr-resnet-50")
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
images_processed = processor(images, return_tensors="pt", device="cuda")
```

Expand Down
12 changes: 6 additions & 6 deletions docs/source/en/internal/import_utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,17 @@ object for which you are lacking a dependency will error-out when calling any me
This object is still importable:

```python
>>> from transformers import DetrImageProcessorFast
>>> print(DetrImageProcessorFast)
<class 'DetrImageProcessorFast'>
>>> from transformers import DetrImageProcessor
>>> print(DetrImageProcessor)
<class 'DetrImageProcessor'>
```

However, no method can be called on that object:

```python
>>> DetrImageProcessorFast.from_pretrained()
ImportError:
DetrImageProcessorFast requires the Torchvision library but it was not found in your environment. Check out the instructions on the
>>> DetrImageProcessor.from_pretrained()
ImportError:
DetrImageProcessor requires the Torchvision library but it was not found in your environment. Check out the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
Please note that you may need to restart your runtime after installation.
```
Expand Down
Loading
Loading