-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Add in-out modalities as class attribute per model #41366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add in-out modalities as class attribute per model #41366
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
gante
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can automate these variables, instead of having to manually define them. E.g. can we look at the signature of forward and, based on arguments present / type hints, determine modalities?
(fewer manual flags = smaller odds of human error = fewer bugs)
yeah, also thought of it. It is doable for most models but there are some tricky ones as well, For example we don't have a consistent naming habit for video modality or we have no way to say what is being output by a model that has an overwritten |
imo this would be an improvement :) also an incentive to nudge contributors towards standard names and definitions! |
|
But check with @ArthurZucker before committing code! (See tenet number 4: standardization for model definitions, abstractions for infra. I would place this under infra) |
|
I think this is a decent approach. waiting on this PR as it helps quite a lot - rather than depending on heuristics or some sort of registry we want to get the input-output modalities supported by the model via config. I hope we can adopt this as a standard soon @ArthurZucker |
|
@bot /style |
|
Style fix is beginning .... View the workflow run here. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicit and non automatic looks better for now IMO 😉
|
[For maintainers] Suggested jobs to run (before merge) run-slow: aimv2, align, altclip, aria, audio_spectrogram_transformer, autoformer, aya_vision, bark, beit, bit, blip, blip_2, blt, bridgetower, chameleon, chinese_clip |
* update all models * fix copies * explanation comment * better notation in omni model * style * fix copies * output_modalities under generation mixin * fix copies * oh, glm4v also needs conversion
* update all models * fix copies * explanation comment * better notation in omni model * style * fix copies * output_modalities under generation mixin * fix copies * oh, glm4v also needs conversion
What does this PR do?
Branches out from #40884 (comment) to make review and merge faster
Adds a class attr for each model to indicate the supported input and output modalities. Out modalities will be
Nonein case the model is not generative and "text" in most other cases. We have only a few models that can generate audio and image in the output. Note that for encoder decoder models that whisper the input modalities will contain both encoder ("audio") and decoder ("text") modalitiesThis will be used firstly for the pipeline and we can extend usage later to better testing suite and in preparing inputs better in generation with multimodal LLMs (e.g. if we move multimodal encoding to
GenrationMixin._prepare_multimodal_encodings). No test added at this point, because there is nothing to test