Skip to content

🚨🚨 Refactor Image Processors to support different backends#43514

Merged
yonigozlan merged 59 commits intohuggingface:mainfrom
yonigozlan:refactor-improc-backends
Mar 19, 2026
Merged

🚨🚨 Refactor Image Processors to support different backends#43514
yonigozlan merged 59 commits intohuggingface:mainfrom
yonigozlan:refactor-improc-backends

Conversation

@yonigozlan
Copy link
Copy Markdown
Member

@yonigozlan yonigozlan commented Jan 27, 2026

Image Processor Backend Refactor

Summary

Replaces the dual-file BaseImageProcessor (slow/PIL) + BaseImageProcessorFast (fast/torchvision) design with a unified backend architecture. The image_processing_utils_fast module is removed; all logic lives in image_processing_utils and image_processing_backends.


New Structure

Base classes: BaseImageProcessor in image_processing_utils defines the shared preprocessing pipeline (kwargs validation, input preparation, dispatching to backends). The built-in backend classes live in a separate file, image_processing_backends.py:

  • TorchvisionBackend: GPU-accelerated, batched operations on torch.Tensor, channels-first
  • PilBackend: Portable CPU-only, operations on np.ndarray, channels-first

Each backend implements process_image (convert raw input to backend format) and _preprocess (batch operations). Model-specific processors inherit from one of these backends.

File layout: Per model: image_processing_<model>.py for the torchvision backend (default), image_processing_pil_<model>.py for the PIL backend when both exist. The no-suffix class is now torchvision (opposite of the old *Fast convention).

Shared pipeline: Both backends use the same preprocess flow: validate kwargs → standardize (size, crop_size, pad_size, resample) → prepare inputs via process_image → run _preprocess. Torchvision batches by shape for efficiency; PIL processes images one by one.


Loading Paths & Fallback Logic

AutoImageProcessor.from_pretrained: Config resolution order: image processor config → nested processor config → model config. Class resolution uses image_processor_type or auto_map["AutoImageProcessor"], with fallback from legacy feature_extractor_type / AutoFeatureExtractor.

Backend resolution: New backend parameter replaces use_fast. Resolution order: (1) deprecated use_fast → converted to backend with warning; (2) explicit backend → used as-is; (3) default: "pil" for Lanczos models (Chameleon, Flava, Idefics3, SmolVLM); otherwise "torchvision" if available, else "pil".

Mapping format: IMAGE_PROCESSOR_MAPPING_NAMES entries are now {"torchvision": "ClassName", "pil": "ClassNamePil"} dicts instead of (slow, fast) tuples. Models may expose one or both backends.

Fallback when backend unavailable: _load_class_with_fallback tries the requested backend first, then other backends in the mapping. If torchvision is requested but unavailable, falls back to PIL with a warning.


Registering New / Custom Backends

AutoImageProcessor.register(): Registers image processor classes for a given config. The preferred API is image_processor_classes={"backend_name": ProcessorClass}. You can register one or more backends per model type.

Custom backends: The backend key space is open: any string (e.g. "torchvision", "pil", "mlx", "onnx" etc.) can be used. Each processor class must inherit from BaseImageProcessor and implement process_image and _preprocess. Users select a backend via AutoImageProcessor.from_pretrained(..., backend="custom"). The same fallback logic applies: if the requested backend is unavailable (e.g. missing deps), loading tries other backends in the mapping.

Legacy params: slow_image_processor_class and fast_image_processor_class are deprecated; they are converted to image_processor_classes={"pil": ...} and image_processor_classes={"torchvision": ...} respectively.

Partial updates: When re-registering a config that already has backends, passing image_processor_classes merges into the existing mapping (e.g. adding a new backend without overwriting existing ones).


Backward Compatibility

  • use_fast=True/False: Deprecated warning; converted to backend="torchvision" / backend="pil".
  • image_processor_type: "FooImageProcessorFast" in config: Strips Fast suffix; resolves to base class and requested backend.
  • BaseImageProcessorFast class name: Resolves to TorchvisionBackend.
  • FooImageProcessorFast via import: _LazyModule / get_image_processor_class_from_name resolves to FooImageProcessor when Fast class no longer exists.
  • from transformers import FooImageProcessor when torchvision missing: _LazyModule.__getattr__ transparently falls back to FooImageProcessorPil and warns once (import_utils).
  • auto_map: [slow, fast] list: _resolve_auto_map_class_ref supports both list and new dict format.
  • slow_image_processor_class / fast_image_processor_class in register(): Converted to new image_processor_classes={} dict form.
  • is_fast property: Deprecated; use processor.backend == "torchvision".

Other Changes

  • resample: Single parameter name; Torchvision backend maps PIL resample to InterpolationMode internally.
  • SizeDict: Used consistently in _preprocess; dict literals remain for class attribute defaults.
  • _set_attributes: Centralized in BaseImageProcessor; backends call it in __init__ to resolve kwargs and class defaults.
  • import_utils.BASE_FILE_REQUIREMENTS: Still treats image_processing*_fast.py as torchvision-backed for lazy import structure; legacy _fast filenames may remain until models are fully migrated.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this on a great direction!
I think that having the image_processor_xxxx in general calling self.resize works well as it would fetch the backend's method.

The thing I am not seeing right now is for example how would someone go about adding a new ImageProcessingLlavaNext but with say mlx processing.

He has to create a class that would inherit from his custom mixin, then there needs to be a way for him automatically make sure that MlxImageProcessingLlavaNext is the class that is gonne be used when requesting mlx-backend.

If we are able to take that into account we should be fairly ready!
Otherwise very nice for now!

`bool`: Whether or not this image processor is using the fast (TorchVision) backend.
"""
return False
return self.backend == "torchvision"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this attribute should be removed imo. Numpy can be faster in some cases and it does not represent anything anymore

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a deprecation cycle as I think it's used by downstream libraries

# Backend availability checkers: maps backend names to functions that check availability
_backend_availability_checks = {
"torchvision": is_torchvision_available,
"python": lambda: True, # Python backend is always available
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"python": lambda: True, # Python backend is always available
"numpy": lambda: True, # Python backend is always available

It relies on numpy no? (just saying the name should probably be different)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's a bit misleading, but the vision operation are handled by PiL (with numpy arrays as inputs/outputs), so maybe naming the backend as "pil" is better? Plus it makes the fact that PiL is a required dependency to use this backend more explicit.

@yonigozlan
Copy link
Copy Markdown
Member Author

Thanks @ArthurZucker !
This should already be supported for transformers contributors, and I've added a register_backend() method to make this cleaner for users who don't want to modify the transformers codebase:

  • Contributing to transformers:
# In image_processing_utils.py - create the generic MLX backend if it doesn't already exists
class MlxBackend(ImageProcessingBackend):
    def resize(self, image, size, **kwargs):
        # generic MLX resize
        pass
    # ... other generic MLX methods

# In llava_next/image_processing_llava_next.py - inherit from it
class LlavaNextMlxBackend(MlxBackend):
    def preprocess(self, images, image_grid_pinpoints, **kwargs):
        # LlavaNext-specific patch processing with MLX
        pass

class LlavaNextImageProcessor(BaseImageProcessor):
    _backend_classes = {
        "torchvision": LlavaNextTorchVisionBackend,
        "python": LlavaNextPythonBackend,
        "mlx": LlavaNextMlxBackend,
    }
    _backend_availability_checks = {
        "torchvision": is_torchvision_available,
        "python": lambda: True,
        "mlx": is_mlx_available,
    }
  • Without changing transformers codebase:
from transformers import ImageProcessingBackend, LlavaNextImageProcessor

# No need for users to add both an MLX mixin and an inherited LlavaNextMlxBackend, just overwrite the necessary method directly in LlavaNextMlxBackend
class LlavaNextMlxBackend(ImageProcessingBackend):
    def resize(self, image, size, **kwargs):
        # your MLX implementation
        pass
    # ... implement other methods

LlavaNextImageProcessor.register_backend(
    name="mlx",
    backend_class=LlavaNextMlxBackend,
    availability_check=lambda: is_mlx_available()  # optional
)

processor = LlavaNextImageProcessor(backend="mlx")

Then instantiate like this:

processor = LlavaNextImageProcessor.from_pretrained("llava-hf/llama3-llava-next-8b-hf", backend="mlx")


@requires(backends=("vision",))
@lru_cache(maxsize=10)
def validate_fast_preprocess_arguments(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the fast sense here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None, needs to be renamed/modified 😁

Comment thread src/transformers/image_processing_utils.py Outdated
Comment thread src/transformers/image_processing_utils.py Outdated
Comment thread src/transformers/image_processing_utils.py
Comment thread src/transformers/image_processing_utils.py Outdated
"pil": MyPilBackend,
}

To add a new backend, extend both `_backend_classes` and `_backend_availability_checks`:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rather push for register?

Comment on lines +909 to +928
resample = None
image_mean = None
image_std = None
size = None
default_to_square = True
crop_size = None
do_resize = None
do_center_crop = None
do_pad = None
pad_size = None
do_rescale = None
rescale_factor = 1 / 255
do_normalize = None
do_convert_rgb = None
return_tensors = None
data_format = ChannelDimension.FIRST
input_data_format = None
device = None
model_input_names = ["pixel_values"]
image_seq_length = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really don't understand why you have these when yuou also have ImageKwargs? does it not defeat the point ?

Comment on lines +1175 to +1177
Update kwargs that need further processing before being validated.
Can be overridden by subclasses to customize the processing of kwargs.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function looks very weird.... but okay

Comment on lines +1258 to +1261
# Extract parameters that are only used for preparing the input images
do_convert_rgb = kwargs.pop("do_convert_rgb")
input_data_format = kwargs.pop("input_data_format")
device = kwargs.pop("device")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is weird as well IDG why they can't fall through the rest normally

"""
Preprocess an image or a batch of images.
"""
validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_kwargs_names)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have so many validation steps? validate-kwargs, which are type dicts, then validate typedict, then set default, then further process, then validate process.
It "looks" mega bloated

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better / simpler imo!

return BatchFeature(data={"pixel_values": processed_images}, tensor_type=return_tensors)


class PilBackend(BaseImageProcessor):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a super strong opinion but I would probably split in different files!

For processors that only need standard operations (resize, center crop, rescale, normalize), define class
attributes:

class MyImageProcessor(BaseImageProcessor):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class MyImageProcessor(BaseImageProcessor):
class MyImageProcessor(PilBackend):

IDK I might be wrong!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep sorry the docstrings were out of date!

Comment on lines +143 to +147
class MyImageProcessor(BaseImageProcessor):
_backend_classes = {
"torchvision": MyTorchVisionBackend,
"pil": MyPilBackend,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not valid anymore but you probably did not have the time to update it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated now ;)

Comment on lines +356 to +369
validate_typed_dict(self.valid_kwargs, kwargs)

# Set default kwargs from self
for kwarg_name in self._valid_kwargs_names:
kwargs.setdefault(kwarg_name, getattr(self, kwarg_name, None))

# Update kwargs that need further processing before being validated
kwargs = self._standardize_kwargs(**kwargs)

# Validate kwargs
print("kwargs: ", kwargs)
self._validate_preprocess_kwargs(**kwargs)

return self._preprocess_image_like_inputs(images, *args, **kwargs)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still same comment, but its fine to adress later / it looks a bit more simple !

Comment on lines +709 to +711
if isinstance(image_processor_mapping, (list, tuple)):
pil_class, torchvision_class = image_processor_mapping
image_processor_mapping = {"pil": pil_class, "torchvision": torchvision_class}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 100% sure when would that happen?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe if we update register to support tuple (code that would already be there) then we won't need this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I don't get it, type(config) exits, why do we create image_processor_mapping ? when it should already be correct?

do_reduce_labels: bool = False,
**kwargs,
) -> None:
resample = PILImageResampling.BICUBIC
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing TorchVisionBackendbut then PILImageResampling with PIL, weird but I guess its just an enum

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, beit, bit, blip, bridgetower, chameleon, chinese_clip

@yonigozlan yonigozlan enabled auto-merge March 19, 2026 14:14
@yonigozlan yonigozlan added this pull request to the merge queue Mar 19, 2026
Merged via the queue into huggingface:main with commit 8843333 Mar 19, 2026
28 checks passed
@yonigozlan yonigozlan deleted the refactor-improc-backends branch March 19, 2026 14:47
he-yufeng added a commit to he-yufeng/transformers that referenced this pull request Mar 20, 2026
The elif branch for URL detection (is_remote_url + download_url) was
accidentally removed in huggingface#43514 during the image processor refactor.
This restores URL support with a local download_url helper using httpx,
since the old utils.hub.download_url was intentionally dropped in v5.

Fixes huggingface#44821
ydshieh added a commit that referenced this pull request Mar 23, 2026
* fix

* check

* revert

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
ydshieh added a commit that referenced this pull request Apr 6, 2026
…age processor backend refactor

The PR #43514 refactored _preprocess to pass resample=resample to resize,
but resize still accepted interpolation as its parameter. The resample kwarg
was silently swallowed by **kwargs, causing interpolation to default to BILINEAR
instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values.

Fix by renaming the parameter to resample and converting PIL resample integers to
torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the
pattern used in TorchvisionBackend.resize.
ydshieh added a commit that referenced this pull request Apr 6, 2026
…r backend refactor (#45258)

* Fix SmolVLM video processor resize using wrong interpolation after image processor backend refactor

The PR #43514 refactored _preprocess to pass resample=resample to resize,
but resize still accepted interpolation as its parameter. The resample kwarg
was silently swallowed by **kwargs, causing interpolation to default to BILINEAR
instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values.

Fix by renaming the parameter to resample and converting PIL resample integers to
torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the
pattern used in TorchvisionBackend.resize.

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
louzongzhi pushed a commit to louzongzhi/transformers that referenced this pull request Apr 7, 2026
…r backend refactor (huggingface#45258)

* Fix SmolVLM video processor resize using wrong interpolation after image processor backend refactor

The PR huggingface#43514 refactored _preprocess to pass resample=resample to resize,
but resize still accepted interpolation as its parameter. The resample kwarg
was silently swallowed by **kwargs, causing interpolation to default to BILINEAR
instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values.

Fix by renaming the parameter to resample and converting PIL resample integers to
torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the
pattern used in TorchvisionBackend.resize.

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
…r backend refactor (huggingface#45258)

* Fix SmolVLM video processor resize using wrong interpolation after image processor backend refactor

The PR huggingface#43514 refactored _preprocess to pass resample=resample to resize,
but resize still accepted interpolation as its parameter. The resample kwarg
was silently swallowed by **kwargs, causing interpolation to default to BILINEAR
instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values.

Fix by renaming the parameter to resample and converting PIL resample integers to
torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the
pattern used in TorchvisionBackend.resize.

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants