Fix/pe audio video bugs by massimilianoviola · Pull Request #45886 · huggingface/transformers

massimilianoviola · 2026-05-11T07:00:47Z

What does this PR do?

1. Migrate PE-AV processor to the v5 sub-processor API

PeAudioVideoProcessor still uses the legacy feature_extractor_class and video_processor_class that #41633 deprecated, so every checkpoint load prints two deprecation warnings. The Auto mappings are already registered, so we can drop the legacy attrs and add an explicit __init__.

2. Fix `get_*_embeds` crash

The get_text_*_embeds helpers call the text model without output_hidden_states=True and then access text_outputs.hidden_states[-1], which is None, and so they crash with TypeError.
The fix is one extra kwarg per helper, mirroring the forward behaviour.

3. Friendlier `forward` error for single-modality inputs

forward requires ≥2 modalities; single modalities are handled by the get_*_embeds helpers.
Extended the ValueError to mention the existence of those helpers, so users reading the "you can omit any of the modalities, and use the same forward method" in the model card aren't stuck.

4. Fill in `docs/source/en/model_doc/pe_audio_video.md`

Replaced with a short overview, the clean architecture figure (upload pending here https://huggingface.co/datasets/huggingface/documentation-images/discussions/614), and a link to the well-documented PE-AV collection for checkpoints and end-to-end usage.
I noticed #45612 already proposes a doc fill-in for this page. Happy to drop/adapt this commit if the existing PR is preferred, but the other three fixes are independent of the doc change.

Testing

Run facebook/pe-av-small with the minimal code below.

import torch
from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("facebook/pe-av-small").eval()
processor = AutoProcessor.from_pretrained("facebook/pe-av-small")
# #4: AutoProcessor.from_pretrained above no longer emits deprecation warnings.
text_inputs = processor(text=["a photo of a cat", "a person speaking"], return_tensors="pt", padding=True)

# #1: previously crashed with TypeError, now returns torch.Size([2, 1024])
with torch.no_grad():
    out = model.get_text_audio_video_embeds(
        input_ids=text_inputs["input_ids"],
        attention_mask=text_inputs.get("attention_mask"),
    )
print(out.shape)

# #2: error now points to get_*_embeds helpers
try:
    model(**text_inputs)
except ValueError as e:
    print(e)

Code Agent Policy

[x ] I confirm that this is not a pure code agent PR.

Who can review?

First time so don't hate my if I tag wrong people @zucchini-nlp @stevhliu XD

zucchini-nlp · 2026-05-11T10:14:44Z

+    def __init__(self, feature_extractor=None, video_processor=None, tokenizer=None, **kwargs):
+        super().__init__(feature_extractor, video_processor, tokenizer, **kwargs)


zucchini-nlp

Thanks, lgtm!

zucchini-nlp · 2026-05-11T10:15:25Z

run-slow: pe_audio_video

github-actions · 2026-05-11T10:16:58Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/pe_audio_video"]
quantizations: []

HuggingFaceDocBuilderDev · 2026-05-11T10:28:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-05-11T10:29:32Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	102f0820	workflow commit (merge commit)
PR	9c817809	branch commit (from PR)
main	6c66de3f	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

stevhliu

nice, the docs will indeed be handled in the other PR! good to merge once that's dropped here :)

massimilianoviola · 2026-05-11T16:24:50Z

nice, the docs will indeed be handled in the other PR! good to merge once that's dropped here :)

ok, I'll revert the doc change then! thanks

This reverts commit 9c81780.

github-actions · 2026-05-11T16:28:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: pe_audio_video

* Register correct mapping * set output_hidden_states in get_text_*_embeds helpers * point to get_*_embeds in forward error message * Populate PE AV documentation * Revert "Populate PE AV documentation" This reverts commit 9c81780.

massimilianoviola added 4 commits May 10, 2026 22:04

Register correct mapping

0bc3833

set output_hidden_states in get_text_*_embeds helpers

65fd8d4

point to get_*_embeds in forward error message

6cbfa4e

Populate PE AV documentation

9c81780

zucchini-nlp reviewed May 11, 2026

View reviewed changes

zucchini-nlp approved these changes May 11, 2026

View reviewed changes

stevhliu reviewed May 11, 2026

View reviewed changes

Revert "Populate PE AV documentation"

3f7312c

This reverts commit 9c81780.

Merge branch 'main' into fix/pe-audio-video-bugs

f98e891

zucchini-nlp added this pull request to the merge queue May 12, 2026

Merged via the queue into huggingface:main with commit a4c91a1 May 12, 2026
22 checks passed

massimilianoviola deleted the fix/pe-audio-video-bugs branch May 12, 2026 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/pe audio video bugs#45886

Fix/pe audio video bugs#45886
zucchini-nlp merged 6 commits into
huggingface:mainfrom
massimilianoviola:fix/pe-audio-video-bugs

massimilianoviola commented May 11, 2026 •

edited

Loading

Uh oh!

zucchini-nlp May 11, 2026

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

stevhliu left a comment

Uh oh!

massimilianoviola commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		def __init__(self, feature_extractor=None, video_processor=None, tokenizer=None, **kwargs):
		super().__init__(feature_extractor, video_processor, tokenizer, **kwargs)

Conversation

massimilianoviola commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

1. Migrate PE-AV processor to the v5 sub-processor API

2. Fix get_*_embeds crash

3. Friendlier forward error for single-modality inputs

4. Fill in docs/source/en/model_doc/pe_audio_video.md

Testing

Code Agent Policy

Who can review?

Uh oh!

zucchini-nlp May 11, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

CI Results

Commit Info

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

massimilianoviola commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

massimilianoviola commented May 11, 2026 •

edited

Loading

2. Fix `get_*_embeds` crash

3. Friendlier `forward` error for single-modality inputs

4. Fill in `docs/source/en/model_doc/pe_audio_video.md`