[v5] 🚨Refactor subprocessors handling in processors#41633
Conversation
| "AutoFeatureExtractor": "FeatureExtractionMixin", | ||
| "AutoImageProcessor": "ImageProcessingMixin", | ||
| "AutoVideoProcessor": "BaseVideoProcessor", | ||
| "audio_tokenizer": "DacModel", |
There was a problem hiding this comment.
We should be able to use AutoModelForAudioTokenization, no?
transformers/src/transformers/models/auto/modeling_auto.py
Lines 2248 to 2249 in e20df45
We want to standardize this in the future for other models as well
zucchini-nlp
left a comment
There was a problem hiding this comment.
Nice, the attributes class attribute was indeed a bit redundant. LGTM though you might need to rebase main and check if tests pass. I just merged a non-legacy saving PR
| ), | ||
| ), | ||
| ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), | ||
| ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), |
There was a problem hiding this comment.
not related to this PR, but using PreTrainedTokenizer as auto-class looks funny 😄
There was a problem hiding this comment.
Yes not sure if that should be the case @itazap is that expected/is it a potential issue?
There was a problem hiding this comment.
Was wondering the same thing 👀
There was a problem hiding this comment.
Indeed, why is it added here? I see smollm3 only has the Fast one
| ("video_llava", "VideoLlavaVideoProcessor"), | ||
| ("videomae", "VideoMAEVideoProcessor"), | ||
| ("vjepa2", "VJEPA2VideoProcessor"), | ||
| ("video_llama_3", "VideoLlama3VideoProcessor"), # PLACEHOLDER - needs proper video processor class |
There was a problem hiding this comment.
There was a problem hiding this comment.
Yes mb, remnants from the script...
| """ | ||
|
|
||
| attributes = ["image_processor", "tokenizer"] | ||
| valid_kwargs = ["chat_template", "num_image_tokens"] |
There was a problem hiding this comment.
oh, let's delete valid_kwargs wherever it was left, it's not used anywhere iirc. Can totally do in a separate PR
There was a problem hiding this comment.
Nice, might as well add it here
| attributes = ["image_processor", "tokenizer", "qformer_tokenizer"] | ||
| image_processor_class = ("BlipImageProcessor", "BlipImageProcessorFast") | ||
| tokenizer_class = "AutoTokenizer" | ||
| qformer_tokenizer_class = "AutoTokenizer" |
There was a problem hiding this comment.
do we need qformer_tokenizer_class? In the InstrutcBlipVideo I can see it is deleted
There was a problem hiding this comment.
Indeed we can delete it!
|
|
||
| attributes = ["image_processor", "char_tokenizer"] | ||
| image_processor_class = ("ViTImageProcessor", "ViTImageProcessorFast") | ||
| char_tokenizer_class = "MgpstrTokenizer" |
There was a problem hiding this comment.
same question, is it because they have a prefix before tokenizer?
There was a problem hiding this comment.
yep my scripts failed on these 😅. Thanks for pointing it out!
| image_processor = EfficientNetImageProcessor.from_pretrained(self.tmpdirname) | ||
| image_processor.save_pretrained(self.tmpdirname) | ||
| tokenizer = BertTokenizer.from_pretrained(self.tmpdirname) | ||
| tokenizer.save_pretrained(self.tmpdirname) |
There was a problem hiding this comment.
i think this might cause some issues in tests after I merged non_legacy processor saving. We'll end up with preprocessor_config.json and processor_config.json in the same dir.
I remember some tests try to manipulate configs by loading-saving only the image processor as standalone or as part of processor, they might start failing after rebase
There was a problem hiding this comment.
Ah yes, in general I think we can standardize/simplify the processor tests a lot more in ProcessorTesterMixin. Right now it's a bit of a nightmare every time we want to make a change (I was going crazy on the default to fast image procesors PR because of this). I plan to open a PR to change the tests soon!
There was a problem hiding this comment.
Thanks, updating the tests would help a lot, and prob we could delete many oevrwritten tests as well
|
Do you want me to trigger a full CI for this PR? Better to have it , @zucchini-nlp can confirm it's helpful 😅 . Let me know once it's ready (and you want me to trigger) |
|
I think you can now @ydshieh , thank you! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Didn't see this at first. Looking at the test and remote code, I don't think that will be very breaking. Haven't seen any model in the hub without an |
CircleCI still some failing jobs, so I will wait until it's ✅ here (ping me when it's ✅ 🙏 ) |
|
Ah sorry about that @ydshieh , it should be good now! |
|
ok, i will trigger, but only share the reports on Monday. |
Sounds good thanks! |
[v5] 🚨Refactor subprocessors handling in processors #41633 dummy
molbap
left a comment
There was a problem hiding this comment.
That will be a nice cleanup 😁 left a few comments!
| ), | ||
| ), | ||
| ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), | ||
| ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), |
There was a problem hiding this comment.
Was wondering the same thing 👀
[v5] 🚨Refactor subprocessors handling in processors #41633 dummy
…m/yonigozlan/transformers into remove-attributes-from-processors
Cyrilvallez
left a comment
There was a problem hiding this comment.
Very nice, happy to remove all those unncessary attributes!! Just a few comments, feel free to merge once you take them into account!
| ), | ||
| ), | ||
| ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), | ||
| ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), |
There was a problem hiding this comment.
Indeed, why is it added here? I see smollm3 only has the Fast one
|
[For maintainers] Suggested jobs to run (before merge) run-slow: align, altclip, aria, auto, aya_vision, bark, blip, blip_2, bridgetower, bros, chameleon, chinese_clip, clap, clip, clipseg, clvp |
* remove attributes and add all missing sub processors to their auto classes * remove all mentions of .attributes * cleanup * fix processor tests * fix modular * remove last attributes * fixup * fixes after merge * fix wrong tokenizer in auto florence2 * fix missing audio_processor + nits * Override __init__ in NewProcessor and change hf-internal-testing-repo (temporarily) * fix auto tokenizer test * add init to markup_lm * update CustomProcessor in custom_processing * remove print * nit * fix test modeling owlv2 * fix test_processing_layoutxlm * Fix owlv2, wav2vec2, markuplm, voxtral issues * add support for loading and saving multiple tokenizer natively * remove exclude_attributes from save_pretrained * modifs after review
* remove attributes and add all missing sub processors to their auto classes * remove all mentions of .attributes * cleanup * fix processor tests * fix modular * remove last attributes * fixup * fixes after merge * fix wrong tokenizer in auto florence2 * fix missing audio_processor + nits * Override __init__ in NewProcessor and change hf-internal-testing-repo (temporarily) * fix auto tokenizer test * add init to markup_lm * update CustomProcessor in custom_processing * remove print * nit * fix test modeling owlv2 * fix test_processing_layoutxlm * Fix owlv2, wav2vec2, markuplm, voxtral issues * add support for loading and saving multiple tokenizer natively * remove exclude_attributes from save_pretrained * modifs after review
What does this PR do?
Refactor the handling of subprocessors in processors.
attributesattribute in processors, along with all"subprocessor"_classattributesThis PR is a requirement for #41388, as otherwise we'd have to manually check that all
image_processor_classattributes are set to "AutoImageProcessor"Cc @ArthurZucker @Cyrilvallez @zucchini-nlp @molbap (and also @ydshieh as this might break some parts of the CI 👀, although I checked that all processor tests are passing still, except kosmos2.5 but that's because of a PIL.UnidentifiedImageError ;)).
Update: I'm seeing some tests breaking in
test_processor_auto.py, related to registering custom processors and subprocessors in transformers. How used is this and can we break it slightly for v5? 👀Update 2: It looks like it's not really a problem. The only edge case that will break is if a custom processor was defined by inheriting from
ProcessorMixin, without overriding__init__. ProcessorMixin used to have "feature_extractor" and "tokenizer" attributes by default, now it doesn't (which makes more sense imo)Fixed the tests by modifying the custom processor on the hub to add an init.
🚨Breaking change:
For example: