Skip to content

Simplify and standardize processor tests#41773

Merged
yonigozlan merged 52 commits into
huggingface:mainfrom
yonigozlan:simplify-processor-tests
Nov 26, 2025
Merged

Simplify and standardize processor tests#41773
yonigozlan merged 52 commits into
huggingface:mainfrom
yonigozlan:simplify-processor-tests

Conversation

@yonigozlan

Copy link
Copy Markdown
Contributor

What does this PR do?

Improve ProcessorTestMixin to standardize processor tests, especially the setup part.
Requires #41633

@yonigozlan yonigozlan force-pushed the simplify-processor-tests branch from 8e12561 to 1ed7c56 Compare October 22, 2025 17:55
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean-up. I didn't look at all the test files, if CI is green I'll assume they are fine :)

Comment thread src/transformers/models/auto/configuration_auto.py
Comment thread src/transformers/models/gemma3n/processing_gemma3n.py Outdated
original_sizes = original_sizes.tolist()
if isinstance(reshaped_input_sizes, (torch.Tensor, np.ndarray)):
reshaped_input_sizes = reshaped_input_sizes.tolist()
# TODO: add connected components kernel for postprocessing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like still in progress? The args max_hole_area and max_sprinkle_area are not used by fn and I think modular didn't copy self._apply_non_overlapping_constraints

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I need to add the kernels for these in another PR, however reshaped_input_sizes is never needed in any case

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo we need to do all changes in a different PR in that case, because the current changes look unfinished. Up to you though :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @zucchini-nlp, let's keep the focus of this PR and do other changes in another PR all at once instead of adding TODOs etc!

Comment thread tests/models/aya_vision/test_processing_aya_vision.py Outdated
Comment thread tests/test_processing_common.py Outdated
Comment thread tests/test_processing_common.py Outdated
"""
attributes = self.processor_class.get_attributes()

if not any(attr in ["tokenizer", "image_processor"] for attr in attributes):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can support video_processor and feature_extractor, as those are another two common attributes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for video_processor. However it looks like very few feature extractors actually use AudioKwargs, so it's difficult to set a kwargs that will work with all feature extractors safely. Maybe @eustlb you have more context to handle this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think we didn't pay much attention to audio models when updating. Might be smth on audio-team''s roadmap

Comment thread tests/test_processing_common.py Outdated
Comment on lines +542 to +543
for key in input_image_proc:
self.assertAlmostEqual(input_image_proc[key].sum(), input_processor[key].sum(), delta=1e-2)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to sum instead of torch.assertallclose?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it thanks!

# Process with both tokenizer and processor (disable padding to ensure same output)
try:
encoded_processor = processor(text=input_str, padding=False, return_tensors="pt")
except Exception:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we catch a specific exception here (ValueError)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting different type of error there, so hard to specify

Comment thread tests/test_processing_common.py Outdated
@yonigozlan

Copy link
Copy Markdown
Contributor Author

@Cyrilvallez This should be ready to merge!
@ydshieh The failing tests here are also failing on main:
test_overflowing_tokens is failing for test_processing_udop, test_processing_layoutlmv2 and test_processing_layoutxlm. It looks like a missing repo on the hub?

@ydshieh

ydshieh commented Nov 12, 2025

Copy link
Copy Markdown
Collaborator

Hi @yonigozlan

For LayoutLMv2ProcessorTest, after the merged PR

[v5] 🚨Refactor subprocessors handling in processors (#41633)
the error is

commit: f065e40

FAILED tests/models/layoutlmv2/test_processing_layoutlmv2.py::LayoutLMv2ProcessorTest::test_overflowing_tokens - OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Prior, it was

commit: 91d250e

FAILED tests/models/layoutlmv2/test_processing_layoutlmv2.py::LayoutLMv2ProcessorTest::test_overflowing_tokens - ValueError: Words must be of type `List[str]` (single pretokenized example), or `List[List[str]]` (batch of pretokenized examples).

Could you check this?

I will check what I can do once the error being back to the same error. They are failing for quite some time 😢

@Cyrilvallez Cyrilvallez left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup! Left some comments but as I'm not super familiar with processor tests, feel free to disregard if out-of-scope. If @zucchini-nlp is onboard, then feel free to merge!

original_sizes = original_sizes.tolist()
if isinstance(reshaped_input_sizes, (torch.Tensor, np.ndarray)):
reshaped_input_sizes = reshaped_input_sizes.tolist()
# TODO: add connected components kernel for postprocessing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @zucchini-nlp, let's keep the focus of this PR and do other changes in another PR all at once instead of adding TODOs etc!

Comment on lines -132 to +346
components = self.prepare_components()
processor = self.processor_class(**components, **self.prepare_processor_dict())
processor = self.processor_class.from_pretrained(self.tmpdirname)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that assuming that saving/loading works perfectly? I.e. is it not clashing a bit with the test philosophy as we also test that saving and loading are behaving as they shiould

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i see what you mean. We have several tests that test if we get the same processor after saving and reloading with save and from_pretrained. However we don't have a test to check that loading a processor with from pretrained is equivalent to loading each sub-components separately with from pretrained and instantiating the processor with these loaded sub components after, so I just added one (test_processor_from_pretrained_vs_from_components).

Comment on lines +135 to +139
@classmethod
def _setup_test_attributes(cls, processor):
# to override in the child class to define class attributes
# such as image_token, video_token, audio_token, etc.
pass

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would maybe be a bit clearer/easier to do in __init__?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have this as a separate method, as it will be called by the setUpClass which can pass it a processor object. It might be weird to have init not be the entrypoint of the object

@yonigozlan

Copy link
Copy Markdown
Contributor Author

@zucchini-nlp @Cyrilvallez Thanks for the reviews! Waiting on your approval to merge now :)

One thing to note, I had to add a config for layoutxlm to have all tests pass. This hides a deeper issues, basically we have a few models in the library that don't have a config (namely code_llama, gpt-sw3, granitevision and nougat), but this messes up completely the "SUBPROCESSOR"_MAPPING that are built as _LazyAutoMapping objects in auto_factory.py. _LazyAutoMapping defines a _reverse_config_mapping dict, but if a config is used for several models in CONFIG_MAPPING_NAMES in configuration_auto.py, all but one of the mapping will be overwritten.

Imo the solution would be to always have a config when adding a model, especially now with modular, no reason not to have that.

I'll open a PR to enforce that, unless you think there is another way to handle the issue.

@zucchini-nlp zucchini-nlp left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Layout config addition looks fine to me. I wonder how we ended up using the same config with two models given the "one model - one file" philosophy. If get special model arch which can't map correctly with config mapping, a custom setup_{attribute} is another way imo

Haven't looked at all model tests, the common tests look good though. Thanks!

Comment on lines -1353 to -1354
rope_deltas (`torch.LongTensor` of shape `(batch_size, )`, *optional*):
The rope index difference between sequence length and multimodal rope.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks, the repo check complains on the docs order. Just noticed that the arg isn't even in the signature 😆

vocab_size (`int`, *optional*, defaults to 30522):
Vocabulary size of the LayoutLMv2 model. Defines the number of different tokens that can be represented by
the `inputs_ids` passed when calling [`LayoutLMv2Model`] or [`TFLayoutLMv2Model`].
Vocabulary size of the LayoutXLM model. Defines the number of different tokens that can be represented by

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes arent' needed, no? The config is called LayoutLMv2Config

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes my bad, thanks for catching that

Args:
vocab_size (`int`, *optional*, defaults to 30522):
Vocabulary size of the LayoutXLM model. Defines the number of different tokens that can be represented by
the `inputs_ids` passed when calling [`LayoutXLMModel`] or [`TFLayoutXLMModel`].

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super related to the PR, looks like there are a few references to a non-existing TF class in the docs

Comment on lines +475 to +480
has_image_processor = "image_processor" in attributes
if has_image_processor:
additional_kwargs["do_normalize"] = False
has_video_processor = "video_processor" in attributes
if has_video_processor:
additional_kwargs["do_normalize"] = False

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting point I have been thinking recently, especially with the use_fast flag. We have no way for users to override kwargs from one attribute only. The kwargs are passed to all processor's attributes

Maybe we can consider after v5 making it separate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed that could be useful. I guess users can still load the subcomponents separately with from_pretrained and different use_fast flags, but not ideal indeed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I just remembered one person was trying to load a processor with slow image processor and fast tokenizer, because the model has no support for a slow tokenizer

@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, edgetam, gemma3n, glm46v, glm4v, glm4v_moe, layoutlmv2, layoutxlm, llava_next, llava_onevision, mgp_str, owlvit, sam, sam2, sam2_video, sam3

@yonigozlan yonigozlan enabled auto-merge (squash) November 26, 2025 17:40
@yonigozlan yonigozlan merged commit 06d52fe into huggingface:main Nov 26, 2025
23 checks passed
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* remove attributes and add all missing sub processors to their auto classes

* remove all mentions of .attributes

* cleanup

* fix processor tests

* fix modular

* remove last attributes

* fixup

* fixes after merge

* fix wrong tokenizer in auto florence2

* fix missing audio_processor + nits

* Override __init__ in NewProcessor and change hf-internal-testing-repo (temporarily)

* fix auto tokenizer test

* add init to markup_lm

* update CustomProcessor in custom_processing

* remove print

* nit

* refactor processor tests first part

* refactor part 2

* fix test modeling owlv2

* fix test_processing_layoutxlm

* Fix owlv2, wav2vec2, markuplm, voxtral issues

* part3

* refactor all processor with mixin

* add support for loading and saving multiple tokenizer natively

* remove exclude_attributes from save_pretrained

* get processor from pretrained instead of components in tests

* skip tests in colqwen2, pixtral

* modifs after review

* fix style and copies

* Fix after review

* add test_processor_from_pretrained_vs_from_components, fix failing tests

* fix overflowing_tokens tests

* add config for layoutxlm

* fix ci

* use modular

* fic docstring

* Standardize mgp_str tests

* fix after review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants