Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
25f8152
initial commit
zucchini-nlp Sep 15, 2025
8f71e71
fix tests
zucchini-nlp Sep 15, 2025
beb8517
fix copies, tests and rename pipe
zucchini-nlp Sep 15, 2025
06579e5
another rename
zucchini-nlp Sep 15, 2025
0388c2e
fix copies again
zucchini-nlp Sep 15, 2025
f2ef098
activate pipeline mixin in some models
zucchini-nlp Sep 15, 2025
4c5fbb5
audio loading
zucchini-nlp Sep 15, 2025
cb18262
typo
zucchini-nlp Sep 15, 2025
08873fb
fix the test
zucchini-nlp Sep 15, 2025
06dddc1
stupid typo in filename
zucchini-nlp Sep 15, 2025
338baa0
fix copies
zucchini-nlp Sep 15, 2025
7a5e080
docs
zucchini-nlp Sep 15, 2025
076105d
forgot
zucchini-nlp Sep 15, 2025
5670450
fix pipe tests
zucchini-nlp Sep 16, 2025
8665854
fix copies
zucchini-nlp Sep 16, 2025
aafaae4
fix test
zucchini-nlp Sep 16, 2025
4252a4c
lets not pass it explicitly
zucchini-nlp Sep 16, 2025
f47372a
final fix
zucchini-nlp Sep 16, 2025
51057b3
rename in test files as well
zucchini-nlp Sep 16, 2025
ee1251d
fix again after reordering...
zucchini-nlp Sep 16, 2025
073b782
Merge branch 'main' into auto-multimodal
zucchini-nlp Sep 16, 2025
d9e74e7
add qwen2 audio
zucchini-nlp Sep 17, 2025
654db8f
Merge branch 'main' into auto-multimodal
zucchini-nlp Sep 24, 2025
61428ad
add qwen3-omni
zucchini-nlp Sep 24, 2025
e3b2318
wait, I didn't push it last time?
zucchini-nlp Sep 26, 2025
9c2404f
it's only torch from now on
zucchini-nlp Sep 26, 2025
067061b
how was the model merged with docstring issues?
zucchini-nlp Sep 26, 2025
79f9275
merge main
zucchini-nlp Oct 16, 2025
51fafd3
make style
zucchini-nlp Oct 16, 2025
5855a4a
Merge remote-tracking branch 'upstream/main' into auto-multimodal
zucchini-nlp Oct 16, 2025
cfd8d1b
requires backend depends on input modalities
zucchini-nlp Oct 16, 2025
67f8022
add repr
zucchini-nlp Oct 16, 2025
271ebd1
Merge branch 'main' into auto-multimodal
zucchini-nlp Nov 6, 2025
df7e556
Merge branch 'main' into auto-multimodal
zucchini-nlp Nov 13, 2025
28fd203
fix copies
zucchini-nlp Nov 13, 2025
8101f3f
merge main
zucchini-nlp Nov 17, 2025
27c15e4
fox copies, new models were added
zucchini-nlp Nov 17, 2025
40c96a3
merge main
zucchini-nlp Nov 24, 2025
e344128
and now fix copies
zucchini-nlp Nov 24, 2025
d4da21e
Merge branch 'main' into auto-multimodal
zucchini-nlp Nov 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,8 @@
title: Image tasks with IDEFICS
- local: tasks/image_text_to_text
title: Image-text-to-text
- local: tasks/any_to_any
title: Any-to-any
- local: tasks/video_text_to_text
title: Video-text-to-text
- local: tasks/visual_document_retrieval
Expand Down
6 changes: 6 additions & 0 deletions docs/source/en/main_classes/pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -485,6 +485,12 @@ Pipelines available for multimodal tasks include the following.
- __call__
- all

### AnyToAnyPipeline

[[autodoc]] AnyToAnyPipeline
- __call__
- all

### MaskGenerationPipeline

[[autodoc]] MaskGenerationPipeline
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/auto.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,10 @@ The following auto classes are available for the following audio tasks.

The following auto classes are available for the following multimodal tasks.

### AutoModelForMultimodalLM

[[autodoc]] AutoModelForMultimodalLM

### AutoModelForTableQuestionAnswering

[[autodoc]] AutoModelForTableQuestionAnswering
Expand Down
134 changes: 134 additions & 0 deletions docs/source/en/tasks/any_to_any.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Multimodal Generation

[[open-in-colab]]

Multimodal (any-to-any) models are language models capable of processing diverse types of input data (e.g., text, images, audio, or video) and generating outputs in any of these modalities. Unlike traditional unimodal or fixed-modality models, they allow flexible combinations of input and output, enabling a single system to handle a wide range of tasks: from text-to-image generation to audio-to-text transcription, image captioning, video understanding, and so on. This task shares many similarities with image-text-to-text, but supports a wider range of input and output modalities.

In this guide, we provide a brief overview of any-to-any models and show how to use them with Transformers for inference. Unlike Vision LLMs, which are typically limited to vision-and-language tasks, omni-modal models can accept any combination of modalities (e.g., text, images, audio, video) as input, and generate outputs in different modalities, such as text or images.

Let’s begin by installing dependencies:

```bash
pip install -q transformers accelerate flash_attn
```

Let's initialize the model and the processor.

```python
from transformers import AutoProcessor, AutoModelForMultimodalLM, infer_device
import torch

device = torch.device(infer_device())
model = AutoModelForMultimodalLM.from_pretrained(
"Qwen/Qwen2.5-Omni-3B",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
).to(device)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-3B")
```

These models typically include a [chat template](./chat_templating) to structure conversations across modalities. Inputs can mix images, text, audio, or other supported formats in a single turn. Outputs may also vary (e.g., text generation or audio generation), depending on the configuration.

Below is an example providing a "text + audio" input and requesting a text response.

```python
messages = [
{
"role": "user",
"content": [
{"type": "audio", "url": "https://huggingface.co/datasets/raushan-testing-hf/audio-test/resolve/main/f2641_0_throatclearing.wav"},
{"type": "text", "text": "What do you hear in this audio?"},
]
},
]
```

We will now call the processors' [`~ProcessorMixin.apply_chat_template`] method to preprocess its output along with the image inputs.

```python
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
)
```

We can now pass the preprocessed inputs to the model.

```python
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=100)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
```

## Pipeline

The fastest way to get started is to use the [`Pipeline`] API. Specify the `"any-to-any"` task and the model you want to use.

```python
from transformers import pipeline
pipe = pipeline("any-to-any", model="mistralai/Voxtral-Mini-3B-2507")
```

The example below uses chat templates to format the text inputs and uses audio modality as an multimodal data.

```python
messages = [
{
"role": "user",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/raushan-testing-hf/audio-test/resolve/main/glass-breaking-151256.mp3",
},
{"type": "text", "text": "What do you hear in this audio?"},
],
},
]
```

Pass the chat template formatted text and image to [`Pipeline`] and set `return_full_text=False` to remove the input from the generated output.

```python
outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
outputs[0]["generated_text"]
```

Any-to-any pipeline also supports generating audio or images with any-to-any models. For that you need to set `generation_mode` parameter. Do not forget to set video sampling to the desired FPS, otherwise the whole video will be loaded without sampling. Here is an example code:

```python
import soundfile as sf
pipe = pipeline("any-to-any", model="Qwen/Qwen2.5-Omni-3B")
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": "https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/Cooking_cake.mp4"},
{"type": "text", "text": "Describe this video."},
],
},
]
output = pipe(text=messages, fps=1, load_audio_from_video=True, max_new_tokens=20, generation_mode="audio")
sf.write("generated_audio.wav", out[0]["generated_audio"])
```

2 changes: 2 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@
"loss": [],
"modelcard": ["ModelCard"],
"pipelines": [
"AnyToAnyPipeline",
"AudioClassificationPipeline",
"AutomaticSpeechRecognitionPipeline",
"CsvPipelineDataFormat",
Expand Down Expand Up @@ -636,6 +637,7 @@
from .optimization import get_wsd_schedule as get_wsd_schedule

# Pipelines
from .pipelines import AnyToAnyPipeline as AnyToAnyPipeline
from .pipelines import AudioClassificationPipeline as AudioClassificationPipeline
from .pipelines import AutomaticSpeechRecognitionPipeline as AutomaticSpeechRecognitionPipeline
from .pipelines import CsvPipelineDataFormat as CsvPipelineDataFormat
Expand Down
17 changes: 17 additions & 0 deletions src/transformers/feature_extraction_sequence_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

import numpy as np

from .audio_utils import is_valid_audio, load_audio
from .feature_extraction_utils import BatchFeature, FeatureExtractionMixin
from .utils import PaddingStrategy, TensorType, is_torch_tensor, logging, to_numpy

Expand Down Expand Up @@ -366,3 +367,19 @@ def _get_padding_strategies(self, padding=False, max_length=None):
)

return padding_strategy

def fetch_audio(self, audio_url_or_urls: Union[str, list[str], list[list[str]]]):
"""
Convert a single or a list of urls into the corresponding `np.ndarray` objects.

If a single url is passed, the return value will be a single object. If a list is passed a list of objects is
returned.
"""
if isinstance(audio_url_or_urls, list):
return [self.fetch_audio(x) for x in audio_url_or_urls]
elif isinstance(audio_url_or_urls, str):
return load_audio(audio_url_or_urls)
elif is_valid_audio(audio_url_or_urls):
return audio_url_or_urls
else:
raise TypeError(f"only a single or a list of entries is supported but got type={type(audio_url_or_urls)}")
2 changes: 1 addition & 1 deletion src/transformers/generation/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -367,7 +367,7 @@ class GenerationMixin(ContinuousMixin):
"""

# Should be overwritten by models that can generate non-text output
output_modalities = "text"
output_modalities = ("text",)

def adjust_generation_fn(
self,
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/aimv2/modeling_aimv2.py
Original file line number Diff line number Diff line change
Expand Up @@ -394,7 +394,7 @@ class Aimv2PreTrainedModel(PreTrainedModel):

config: Aimv2Config
base_model_prefix = "aimv2"
input_modalities = "image"
input_modalities = ("image",)
supports_gradient_checkpointing = True
_no_split_modules = [
"Aimv2EncoderLayer",
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/aimv2/modular_aimv2.py
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,7 @@ class Aimv2PreTrainedModel(PreTrainedModel):

config: Aimv2Config
base_model_prefix = "aimv2"
input_modalities = "image"
input_modalities = ("image",)
supports_gradient_checkpointing = True
_no_split_modules = [
"Aimv2EncoderLayer",
Expand Down
6 changes: 3 additions & 3 deletions src/transformers/models/align/modeling_align.py
Original file line number Diff line number Diff line change
Expand Up @@ -821,7 +821,7 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
class AlignPreTrainedModel(PreTrainedModel):
config: AlignConfig
base_model_prefix = "align"
input_modalities = ["image", "text"]
input_modalities = ("image", "text")
supports_gradient_checkpointing = True

@torch.no_grad()
Expand Down Expand Up @@ -853,7 +853,7 @@ def _init_weights(self, module: nn.Module):
)
class AlignTextModel(AlignPreTrainedModel):
config: AlignTextConfig
input_modalities = "text"
input_modalities = ("text",)
_no_split_modules = ["AlignTextEmbeddings"]

def __init__(self, config: AlignTextConfig, add_pooling_layer: bool = True):
Expand Down Expand Up @@ -974,7 +974,7 @@ def forward(
class AlignVisionModel(AlignPreTrainedModel):
config: AlignVisionConfig
main_input_name = "pixel_values"
input_modalities = "image"
input_modalities = ("image",)
supports_gradient_checkpointing = False

def __init__(self, config: AlignVisionConfig):
Expand Down
6 changes: 3 additions & 3 deletions src/transformers/models/altclip/modeling_altclip.py
Original file line number Diff line number Diff line change
Expand Up @@ -767,7 +767,7 @@ def forward(self, pixel_values: torch.FloatTensor, interpolate_pos_encoding=Fals
class AltCLIPPreTrainedModel(PreTrainedModel):
config: AltCLIPConfig
base_model_prefix = "altclip"
input_modalities = ["image", "text"]
input_modalities = ("image", "text")
supports_gradient_checkpointing = True
_no_split_module = []

Expand Down Expand Up @@ -872,7 +872,7 @@ def forward(
class AltCLIPVisionModel(AltCLIPPreTrainedModel):
config: AltCLIPVisionConfig
main_input_name = "pixel_values"
input_modalities = "image"
input_modalities = ("image",)

def __init__(self, config: AltCLIPVisionConfig):
super().__init__(config)
Expand Down Expand Up @@ -1031,7 +1031,7 @@ def forward(

class AltCLIPTextModel(AltCLIPPreTrainedModel):
config: AltCLIPTextConfig
input_modalities = "text"
input_modalities = ("text",)

def __init__(self, config):
super().__init__(config)
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/aria/modeling_aria.py
Original file line number Diff line number Diff line change
Expand Up @@ -573,7 +573,7 @@ def forward(
class AriaTextPreTrainedModel(PreTrainedModel):
config: AriaTextConfig
base_model_prefix = "model"
input_modalities = ["image", "text"]
input_modalities = ("image", "text")
_no_split_modules = ["AriaTextDecoderLayer", "AriaGroupedExpertsGemm"]
supports_gradient_checkpointing = True
_skip_keys_device_placement = "past_key_values"
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/aria/modular_aria.py
Original file line number Diff line number Diff line change
Expand Up @@ -1184,7 +1184,7 @@ def __init__(self, config: AriaTextConfig, layer_idx: int):
class AriaTextPreTrainedModel(PreTrainedModel):
config: AriaTextConfig
base_model_prefix = "model"
input_modalities = ["image", "text"]
input_modalities = ("image", "text")
_no_split_modules = ["AriaTextDecoderLayer", "AriaGroupedExpertsGemm"]
supports_gradient_checkpointing = True
_skip_keys_device_placement = "past_key_values"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,7 @@ def forward(
class AudioFlamingo3PreTrainedModel(PreTrainedModel):
config: AudioFlamingo3Config
base_model_prefix = "model"
input_modalities = ["audio", "text"]
input_modalities = ("audio", "text")
supports_gradient_checkpointing = True
_no_split_modules = ["AudioFlamingo3Attention"]
_skip_keys_device_placement = "past_key_values"
Expand Down
25 changes: 25 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -1029,6 +1029,21 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
]
)

# Models that accept text and optionally multimodal data in inputs
# and can generate text and optionally multimodal data.
MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES = OrderedDict(
[
*list(MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES.items()),
("granite_speech", "GraniteSpeechForConditionalGeneration"),
("kyutai_speech_to_text", "KyutaiSpeechToTextForConditionalGeneration"),
("phi4_multimodal", "Phi4MultimodalForCausalLM"),
("qwen2_5_omni", "Qwen2_5OmniForConditionalGeneration"),
("qwen2_audio", "Qwen2AudioForConditionalGeneration"),
("qwen3_omni_moe", "Qwen3OmniMoeForConditionalGeneration"),
("voxtral", "VoxtralForConditionalGeneration"),
]
)

MODEL_FOR_MASKED_LM_MAPPING_NAMES = OrderedDict(
[
# Model for Masked LM mapping
Expand Down Expand Up @@ -1782,6 +1797,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING = _LazyAutoMapping(
CONFIG_MAPPING_NAMES, MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
)
MODEL_FOR_MULTIMODAL_LM_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES)
MODEL_FOR_RETRIEVAL_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_RETRIEVAL_MAPPING_NAMES)
MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING = _LazyAutoMapping(
CONFIG_MAPPING_NAMES, MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING_NAMES
Expand Down Expand Up @@ -2126,6 +2142,13 @@ def from_pretrained(
AutoModelForImageTextToText = auto_class_update(AutoModelForImageTextToText, head_doc="image-text-to-text modeling")


class AutoModelForMultimodalLM(_BaseAutoModelClass):
_model_mapping = MODEL_FOR_MULTIMODAL_LM_MAPPING


AutoModelForMultimodalLM = auto_class_update(AutoModelForMultimodalLM, head_doc="multimodal generation")


class AutoModelForAudioClassification(_BaseAutoModelClass):
_model_mapping = MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING

Expand Down Expand Up @@ -2276,6 +2299,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
"MODEL_FOR_VISION_2_SEQ_MAPPING",
"MODEL_FOR_RETRIEVAL_MAPPING",
"MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING",
"MODEL_FOR_MULTIMODAL_LM_MAPPING",
"MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING",
"MODEL_MAPPING",
"MODEL_WITH_LM_HEAD_MAPPING",
Expand Down Expand Up @@ -2303,6 +2327,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
"AutoModelForMaskedImageModeling",
"AutoModelForMaskedLM",
"AutoModelForMultipleChoice",
"AutoModelForMultimodalLM",
"AutoModelForNextSentencePrediction",
"AutoModelForObjectDetection",
"AutoModelForPreTraining",
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/autoformer/modeling_autoformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -823,7 +823,7 @@ def forward(
class AutoformerPreTrainedModel(PreTrainedModel):
config: AutoformerConfig
base_model_prefix = "model"
input_modalities = "time"
input_modalities = ("time",)
main_input_name = "past_values"
supports_gradient_checkpointing = True

Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/aya_vision/modeling_aya_vision.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def pixel_shuffle(self, image_features): # B, S, D
class AyaVisionPreTrainedModel(PreTrainedModel):
config: AyaVisionConfig
base_model_prefix = "model"
input_modalities = ["image", "text"]
input_modalities = ("image", "text")
supports_gradient_checkpointing = True
_skip_keys_device_placement = "past_key_values"

Expand Down
Loading