Skip to content

Conversation

@zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Oct 6, 2025

What does this PR do?

Branches out from #40884 (comment) to make review and merge faster

Adds a class attr for each model to indicate the supported input and output modalities. Out modalities will be None in case the model is not generative and "text" in most other cases. We have only a few models that can generate audio and image in the output. Note that for encoder decoder models that whisper the input modalities will contain both encoder ("audio") and decoder ("text") modalities

This will be used firstly for the pipeline and we can extend usage later to better testing suite and in preparing inputs better in generation with multimodal LLMs (e.g. if we move multimodal encoding to GenrationMixin._prepare_multimodal_encodings). No test added at this point, because there is nothing to test

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can automate these variables, instead of having to manually define them. E.g. can we look at the signature of forward and, based on arguments present / type hints, determine modalities?

(fewer manual flags = smaller odds of human error = fewer bugs)

@zucchini-nlp
Copy link
Member Author

E.g. can we look at the signature of forward and, based on arguments present / type hints, determine modalities?

yeah, also thought of it. It is doable for most models but there are some tricky ones as well, For example we don't have a consistent naming habit for video modality or we have no way to say what is being output by a model that has an overwritten generate(). We can have have a default to input_modalities as well similar to output_modalities, but then manually overwrite all models where the pattern does not match

@gante
Copy link
Contributor

gante commented Oct 6, 2025

We can have have a default to input_modalities as well similar to output_modalities, but then manually overwrite all models where the pattern does not match

imo this would be an improvement :) also an incentive to nudge contributors towards standard names and definitions!

@gante
Copy link
Contributor

gante commented Oct 6, 2025

But check with @ArthurZucker before committing code!

(See tenet number 4: standardization for model definitions, abstractions for infra. I would place this under infra)

@adarshxs
Copy link

adarshxs commented Oct 9, 2025

I think this is a decent approach. waiting on this PR as it helps quite a lot - rather than depending on heuristics or some sort of registry we want to get the input-output modalities supported by the model via config. I hope we can adopt this as a standard soon @ArthurZucker

@zucchini-nlp
Copy link
Member Author

@bot /style

@github-actions
Copy link
Contributor

Style fix is beginning .... View the workflow run here.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit and non automatic looks better for now IMO 😉

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, align, altclip, aria, audio_spectrogram_transformer, autoformer, aya_vision, bark, beit, bit, blip, blip_2, blt, bridgetower, chameleon, chinese_clip

@zucchini-nlp zucchini-nlp merged commit 1c36d40 into huggingface:main Oct 16, 2025
22 checks passed
ngazagna-qc pushed a commit to ngazagna-qc/transformers that referenced this pull request Oct 23, 2025
* update all models

* fix copies

* explanation comment

* better notation in omni model

* style

* fix copies

* output_modalities under generation mixin

* fix copies

* oh, glm4v also needs conversion
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* update all models

* fix copies

* explanation comment

* better notation in omni model

* style

* fix copies

* output_modalities under generation mixin

* fix copies

* oh, glm4v also needs conversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants