Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 43 additions & 34 deletions docs/source/en/model_doc/pe_audio.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<!--Copyright 2025 The HuggingFace Inc. team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
Expand All @@ -15,48 +15,48 @@ rendered properly in your Markdown viewer.
-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.*

# PE Audio (Perception Encoder Audio)
# PE Audio

## Overview
[PE Audio](https://huggingface.co/papers/2504.13181) is the audio branch of Meta's Perception Encoder family. It contrastively aligns raw waveforms with text into a shared embedding space, trained on paired audio–caption data for cross-modal retrieval and zero-shot audio classification.

PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space.
The model enables cross-modal retrieval and understanding between audio and text.
Two heads are exposed on top of the same encoder. [`PeAudioModel`] returns one pooled embedding per clip for clip-level retrieval, while [`PeAudioFrameLevelModel`] returns one embedding every 40 ms for event localization and fine-grained temporal analysis.

**Text input**
You can find all the official PE Audio checkpoints under the [perception-encoder-audio-visual](https://huggingface.co/collections/facebook/perception-encoder-audio-visual) collection.

- Produces a single embedding representing the full text.
## Quickstart

**Audio input**

- **PeAudioFrameLevelModel**
- Produces a sequence of embeddings, one every 40 ms of audio.
- Suitable for audio event localization and fine-grained temporal analysis.
- **PeAudioModel**
- Produces a single embedding for the entire audio clip.
- Suitable for global audio-text retrieval tasks.
```py
import torch
from datasets import load_dataset
from transformers import AutoProcessor, PeAudioModel

**The resulting embeddings can be used for:**
processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
model = PeAudioModel.from_pretrained(
"facebook/pe-av-large",
device_map="auto",
)

- Audio event localization
- Cross-modal (audio–text) retrieval and matching
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
labels = ["a dog barking", "a person speaking", "music playing"]

## Usage
audio_inputs = processor.feature_extractor(audio, sampling_rate=48_000, return_tensors="pt").to(model.device)
text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
inputs = {**audio_inputs, **text_inputs}

### Basic usage
with torch.no_grad():
outputs = model(**inputs)

```py
TODO
probs = outputs.logits_audio_text.sigmoid()
print({label: p.item() for label, p in zip(labels, probs[0])})
```

## PeAudioFeatureExtractor

[[autodoc]] PeAudioFeatureExtractor
- __call__

## PeAudioProcessor
## Usage tips and notes

[[autodoc]] PeAudioProcessor
- __call__
- Audio must be mono (`feature_size=1`) and resampled to 48 kHz — the feature extractor warns but does not resample for you. Stereo input is not supported.
- Variable-length audio is handled with `padding_mask` (not the usual `attention_mask`). The mask is downsampled internally by `dac_config.hop_length` before it reaches the encoder, so pass the raw waveform-resolution mask that the feature extractor returns.
- [`PeAudioModel`] returns logits of shape `(n_audio, n_text)`. [`PeAudioFrameLevelModel`] returns `(n_audio, n_text, n_frames)` with one frame every 40 ms. Pick the class that matches the task — they share weights so swapping is cheap.
- The text tower is a shared encoder loaded via `AutoModel` from `config.text_config`. The tokenizer is attached to the processor via `AutoTokenizer`, not a dedicated class.

## PeAudioConfig

Expand All @@ -66,17 +66,26 @@ TODO

[[autodoc]] PeAudioEncoderConfig

## PeAudioFeatureExtractor

[[autodoc]] PeAudioFeatureExtractor
- __call__

## PeAudioProcessor

[[autodoc]] PeAudioProcessor

## PeAudioEncoder

[[autodoc]] PeAudioEncoder
- forward

## PeAudioFrameLevelModel
## PeAudioModel

[[autodoc]] PeAudioFrameLevelModel
[[autodoc]] PeAudioModel
- forward

## PeAudioModel
## PeAudioFrameLevelModel

[[autodoc]] PeAudioModel
[[autodoc]] PeAudioFrameLevelModel
- forward
58 changes: 45 additions & 13 deletions docs/source/en/model_doc/pe_audio_video.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<!--Copyright 2025 The HuggingFace Inc. team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
Expand All @@ -15,24 +15,52 @@ rendered properly in your Markdown viewer.
-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.*

# PE Audio Video (Perception Encoder Audio-Video)
# PE Audio Video

## Overview
[PE Audio Video](https://huggingface.co/papers/2504.13181) is the joint audio–video branch of Meta's Perception Encoder family. It encodes audio and video streams together with a shared text tower, producing contrastive embeddings for every pairwise combination, audio-text, video-text, audio-video, and audio+text-video, from a single forward pass.

TODO
Internally the model aligns the video feature sequence to the audio's temporal resolution via nearest-neighbor interpolation, so clips with different frame rates from sample rates stay in lockstep. The text encoder weights are tied across the audio and video branches.

## Usage
You can find all the official PE Audio Video checkpoints under the [perception-encoder-audio-visual](https://huggingface.co/collections/facebook/perception-encoder-audio-visual) collection.

### Basic usage
## Quickstart

```py
TODO
import torch
from datasets import load_dataset
from transformers import AutoProcessor, PeAudioVideoModel
from transformers.video_utils import load_video

processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
model = PeAudioVideoModel.from_pretrained(
"facebook/pe-av-large",
device_map="auto",
)

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
video, _ = load_video("https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4")
labels = ["a person playing tennis with background crowd", "a dog barking in a park"]

audio_inputs = processor.feature_extractor(audio, sampling_rate=48_000, return_tensors="pt").to(model.device)
video_inputs = processor.video_processor(video, num_frames=16, return_tensors="pt").to(model.device)
text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
inputs = {**audio_inputs, **video_inputs, **text_inputs}

with torch.no_grad():
outputs = model(**inputs)

print("audio-text:", outputs.logits_audio_text.sigmoid().tolist())
print("video-text:", outputs.logits_video_text.sigmoid().tolist())
print("audio-video:", outputs.logits_audio_video.sigmoid().tolist())
```

## PeAudioVideoProcessor
## Usage tips and notes

[[autodoc]] PeAudioVideoProcessor
- __call__
- [`PeAudioVideoModel`] requires at least two of `input_ids`, `input_values`, `pixel_values_videos` — if only two are provided it dispatches to the audio-only or video-only sub-model. Passing all three triggers the joint audio-video-text path and the full set of logit matrices in [`PeAudioVideoOutput`].
- Audio uses `padding_mask` and video uses `padding_mask_videos` simultaneously. They are independent masks; do not conflate them with `attention_mask`, which is reserved for the text tower.
- Audio–video alignment runs per-batch-element inside `_align_video_hidden_state`, so batches with very different audio/video lengths iterate rather than vectorizing. Keep batch items roughly balanced for throughput.
- The text tower's weights are tied across branches via `_tied_weights_keys` — do not try to load separate text encoders for the audio and video halves.

## PeAudioVideoConfig

Expand All @@ -42,12 +70,16 @@ TODO

[[autodoc]] PeAudioVideoEncoderConfig

## PeAudioVideoModel
## PeAudioVideoProcessor

[[autodoc]] PeAudioVideoModel
- forward
[[autodoc]] PeAudioVideoProcessor

## PeAudioVideoEncoder

[[autodoc]] PeAudioVideoEncoder
- forward

## PeAudioVideoModel

[[autodoc]] PeAudioVideoModel
- forward
61 changes: 43 additions & 18 deletions docs/source/en/model_doc/pe_video.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<!--Copyright 2025 The HuggingFace Inc. team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
Expand All @@ -15,44 +15,69 @@ rendered properly in your Markdown viewer.
-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.*

# PE Video (Perception Encoder Video)
# PE Video

## Overview
[PE Video](https://huggingface.co/papers/2504.13181) is the video branch of Meta's Perception Encoder family. It contrastively aligns video clips with text into a shared embedding space, enabling zero-shot video classification and video–text retrieval from a single pretrained backbone.

TODO
The encoder's rotary embeddings and patch embedder treat the temporal axis as a first-class dimension, so variable-length clips can be encoded without tiling each frame independently.

## Usage
You can find all the official PE Audio checkpoints under the [perception-encoder-audio-visual](https://huggingface.co/collections/facebook/perception-encoder-audio-visual) collection.

### Basic usage
## Quickstart

```py
TODO
import torch
from transformers import AutoProcessor, PeVideoModel
from transformers.video_utils import load_video

processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
model = PeVideoModel.from_pretrained(
"facebook/pe-av-large",
device_map="auto",
)

video, _ = load_video("https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4")
labels = ["a person playing tennis", "a person cooking", "a cat sleeping"]

video_inputs = processor.video_processor(video, num_frames=16, return_tensors="pt").to(model.device)
text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
inputs = {**video_inputs, **text_inputs}

with torch.no_grad():
outputs = model(**inputs)

probs = outputs.logits_video_text.sigmoid()
print({label: p.item() for label, p in zip(labels, probs[0])})
```

## PeVideoVideoProcessor
## Usage tips and notes

[[autodoc]] PeVideoVideoProcessor
- __call__
- Variable-length videos use `padding_mask_videos` (not `attention_mask`). The video processor only pads and returns this mask when `return_tensors` is set — without it you get a list of per-clip tensors and no mask.
- Pass `num_frames` to the video processor for fixed-length uniform sampling across `[0, total_frames-1]`. Omit it to fall back to fps-based sampling from the base class. Checkpoints are usually trained at a specific frame count, so match what the checkpoint expects.
- Encoder input is `pixel_values_videos`. The encoder's `main_input_name` is `"pixel_values_videos"` while the full model's is `"input_ids"`, which matters when routing through generic utilities that inspect `main_input_name`.

## PeVideoProcessor
## PeVideoConfig

[[autodoc]] PeVideoProcessor
- __call__
[[autodoc]] PeVideoConfig

## PeVideoEncoderConfig

[[autodoc]] PeVideoEncoderConfig

## PeVideoConfig
## PeVideoVideoProcessor

[[autodoc]] PeVideoConfig
[[autodoc]] PeVideoVideoProcessor

## PeVideoModel
## PeVideoProcessor

[[autodoc]] PeVideoModel
- forward
[[autodoc]] PeVideoProcessor

## PeVideoEncoder

[[autodoc]] PeVideoEncoder
- forward

## PeVideoModel

[[autodoc]] PeVideoModel
- forward
Loading
Loading