Skip to content

Fix/pe audio video bugs#45886

Merged
zucchini-nlp merged 6 commits into
huggingface:mainfrom
massimilianoviola:fix/pe-audio-video-bugs
May 12, 2026
Merged

Fix/pe audio video bugs#45886
zucchini-nlp merged 6 commits into
huggingface:mainfrom
massimilianoviola:fix/pe-audio-video-bugs

Conversation

@massimilianoviola

@massimilianoviola massimilianoviola commented May 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

1. Migrate PE-AV processor to the v5 sub-processor API

PeAudioVideoProcessor still uses the legacy feature_extractor_class and video_processor_class that #41633 deprecated, so every checkpoint load prints two deprecation warnings. The Auto mappings are already registered, so we can drop the legacy attrs and add an explicit __init__.

2. Fix get_*_embeds crash

The get_text_*_embeds helpers call the text model without output_hidden_states=True and then access text_outputs.hidden_states[-1], which is None, and so they crash with TypeError.
The fix is one extra kwarg per helper, mirroring the forward behaviour.

3. Friendlier forward error for single-modality inputs

forward requires ≥2 modalities; single modalities are handled by the get_*_embeds helpers.
Extended the ValueError to mention the existence of those helpers, so users reading the "you can omit any of the modalities, and use the same forward method" in the model card aren't stuck.

4. Fill in docs/source/en/model_doc/pe_audio_video.md

Replaced with a short overview, the clean architecture figure (upload pending here https://huggingface.co/datasets/huggingface/documentation-images/discussions/614), and a link to the well-documented PE-AV collection for checkpoints and end-to-end usage.
I noticed #45612 already proposes a doc fill-in for this page. Happy to drop/adapt this commit if the existing PR is preferred, but the other three fixes are independent of the doc change.

Testing

Run facebook/pe-av-small with the minimal code below.

import torch
from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("facebook/pe-av-small").eval()
processor = AutoProcessor.from_pretrained("facebook/pe-av-small")
# #4: AutoProcessor.from_pretrained above no longer emits deprecation warnings.
text_inputs = processor(text=["a photo of a cat", "a person speaking"], return_tensors="pt", padding=True)

# #1: previously crashed with TypeError, now returns torch.Size([2, 1024])
with torch.no_grad():
    out = model.get_text_audio_video_embeds(
        input_ids=text_inputs["input_ids"],
        attention_mask=text_inputs.get("attention_mask"),
    )
print(out.shape)

# #2: error now points to get_*_embeds helpers
try:
    model(**text_inputs)
except ValueError as e:
    print(e)

Code Agent Policy

  • [x ] I confirm that this is not a pure code agent PR.

Who can review?

First time so don't hate my if I tag wrong people @zucchini-nlp @stevhliu XD

Comment on lines +18 to +19
def __init__(self, feature_extractor=None, video_processor=None, tokenizer=None, **kwargs):
super().__init__(feature_extractor, video_processor, tokenizer, **kwargs)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

niiice!

@zucchini-nlp zucchini-nlp left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

@zucchini-nlp

Copy link
Copy Markdown
Member

run-slow: pe_audio_video

@github-actions

Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/pe_audio_video"]
quantizations: []

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 102f0820 workflow commit (merge commit)
PR 9c817809 branch commit (from PR)
main 6c66de3f base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@stevhliu stevhliu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, the docs will indeed be handled in the other PR! good to merge once that's dropped here :)

@massimilianoviola

Copy link
Copy Markdown
Contributor Author

nice, the docs will indeed be handled in the other PR! good to merge once that's dropped here :)

ok, I'll revert the doc change then! thanks

@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: pe_audio_video

@zucchini-nlp zucchini-nlp added this pull request to the merge queue May 12, 2026
Merged via the queue into huggingface:main with commit a4c91a1 May 12, 2026
22 checks passed
@massimilianoviola massimilianoviola deleted the fix/pe-audio-video-bugs branch May 12, 2026 08:39
jp1924 pushed a commit to jp1924/transformers that referenced this pull request May 18, 2026
* Register correct mapping

* set output_hidden_states in get_text_*_embeds helpers

* point to get_*_embeds in forward error message

* Populate PE AV documentation

* Revert "Populate PE AV documentation"

This reverts commit 9c81780.
khushali9 pushed a commit to khushali9/transformers that referenced this pull request Jun 8, 2026
* Register correct mapping

* set output_hidden_states in get_text_*_embeds helpers

* point to get_*_embeds in forward error message

* Populate PE AV documentation

* Revert "Populate PE AV documentation"

This reverts commit 9c81780.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants