huggingface · stevhliu · May 12, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 30, 2026
diff --git a/docs/source/en/model_doc/pe_audio.md b/docs/source/en/model_doc/pe_audio.md
@@ -1,4 +1,4 @@
-<!--Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -15,48 +15,48 @@ rendered properly in your Markdown viewer.
 -->
 *This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.*
 
-# PE Audio (Perception Encoder Audio)
+# PE Audio
 
-## Overview
+[PE Audio](https://huggingface.co/papers/2504.13181) is the audio branch of Meta's Perception Encoder family. It contrastively aligns raw waveforms with text into a shared embedding space, trained on paired audio–caption data for cross-modal retrieval and zero-shot audio classification.
 
-PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space.
-The model enables cross-modal retrieval and understanding between audio and text.
+Two heads are exposed on top of the same encoder. [`PeAudioModel`] returns one pooled embedding per clip for clip-level retrieval, while [`PeAudioFrameLevelModel`] returns one embedding every 40 ms for event localization and fine-grained temporal analysis.
 
-**Text input**
+You can find all the official PE Audio checkpoints under the [perception-encoder-audio-visual](https://huggingface.co/collections/facebook/perception-encoder-audio-visual) collection.
 
-- Produces a single embedding representing the full text.
+## Quickstart
 
-**Audio input**
-
-- **PeAudioFrameLevelModel**
-  - Produces a sequence of embeddings, one every 40 ms of audio.
-  - Suitable for audio event localization and fine-grained temporal analysis.
-- **PeAudioModel**
-  - Produces a single embedding for the entire audio clip.
-  - Suitable for global audio-text retrieval tasks.
+```py
+import torch
+from datasets import load_dataset
+from transformers import AutoProcessor, PeAudioModel
 
-**The resulting embeddings can be used for:**
+processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
+model = PeAudioModel.from_pretrained(
+    "facebook/pe-av-large",
+    device_map="auto",
+)
 
-- Audio event localization
-- Cross-modal (audio–text) retrieval and matching
+ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+audio = ds[0]["audio"]["array"]
+labels = ["a dog barking", "a person speaking", "music playing"]
 
-## Usage
+audio_inputs = processor.feature_extractor(audio, sampling_rate=48_000, return_tensors="pt").to(model.device)
+text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
+inputs = {**audio_inputs, **text_inputs}
 
-### Basic usage
+with torch.no_grad():
+    outputs = model(**inputs)
 
-```py
-TODO
+probs = outputs.logits_audio_text.sigmoid()
+print({label: p.item() for label, p in zip(labels, probs[0])})
 ```
 
-## PeAudioFeatureExtractor
-
-[[autodoc]] PeAudioFeatureExtractor
-    - __call__
-
-## PeAudioProcessor
+## Usage tips and notes
 
-[[autodoc]] PeAudioProcessor
-    - __call__
+- Audio must be mono (`feature_size=1`) and resampled to 48 kHz — the feature extractor warns but does not resample for you. Stereo input is not supported.
+- Variable-length audio is handled with `padding_mask` (not the usual `attention_mask`). The mask is downsampled internally by `dac_config.hop_length` before it reaches the encoder, so pass the raw waveform-resolution mask that the feature extractor returns.
+- [`PeAudioModel`] returns logits of shape `(n_audio, n_text)`. [`PeAudioFrameLevelModel`] returns `(n_audio, n_text, n_frames)` with one frame every 40 ms. Pick the class that matches the task — they share weights so swapping is cheap.
+- The text tower is a shared encoder loaded via `AutoModel` from `config.text_config`. The tokenizer is attached to the processor via `AutoTokenizer`, not a dedicated class.
 
 ## PeAudioConfig
 
@@ -66,17 +66,26 @@ TODO
 
 [[autodoc]] PeAudioEncoderConfig
 
+## PeAudioFeatureExtractor
+
+[[autodoc]] PeAudioFeatureExtractor
+    - __call__
+
+## PeAudioProcessor
+
+[[autodoc]] PeAudioProcessor
+
 ## PeAudioEncoder
 
 [[autodoc]] PeAudioEncoder
     - forward
 
-## PeAudioFrameLevelModel
+## PeAudioModel
 
-[[autodoc]] PeAudioFrameLevelModel
+[[autodoc]] PeAudioModel
     - forward
 
-## PeAudioModel
+## PeAudioFrameLevelModel
 
-[[autodoc]] PeAudioModel
+[[autodoc]] PeAudioFrameLevelModel
     - forward
diff --git a/docs/source/en/model_doc/pe_audio_video.md b/docs/source/en/model_doc/pe_audio_video.md
@@ -1,4 +1,4 @@
-<!--Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -15,24 +15,52 @@ rendered properly in your Markdown viewer.
 -->
 *This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.*
 
-# PE Audio Video (Perception Encoder Audio-Video)
+# PE Audio Video
 
-## Overview
+[PE Audio Video](https://huggingface.co/papers/2504.13181) is the joint audio–video branch of Meta's Perception Encoder family. It encodes audio and video streams together with a shared text tower, producing contrastive embeddings for every pairwise combination, audio-text, video-text, audio-video, and audio+text-video, from a single forward pass.
 
-TODO
+Internally the model aligns the video feature sequence to the audio's temporal resolution via nearest-neighbor interpolation, so clips with different frame rates from sample rates stay in lockstep. The text encoder weights are tied across the audio and video branches.
 
-## Usage
+You can find all the official PE Audio Video checkpoints under the [perception-encoder-audio-visual](https://huggingface.co/collections/facebook/perception-encoder-audio-visual) collection.
 
-### Basic usage
+## Quickstart
 
 ```py
-TODO
+import torch
+from datasets import load_dataset
+from transformers import AutoProcessor, PeAudioVideoModel
+from transformers.video_utils import load_video
+
+processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
+model = PeAudioVideoModel.from_pretrained(
+    "facebook/pe-av-large",
+    device_map="auto",
+)
+
+ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+audio = ds[0]["audio"]["array"]
+video, _ = load_video("https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4")
+labels = ["a person playing tennis with background crowd", "a dog barking in a park"]
+
+audio_inputs = processor.feature_extractor(audio, sampling_rate=48_000, return_tensors="pt").to(model.device)
+video_inputs = processor.video_processor(video, num_frames=16, return_tensors="pt").to(model.device)
+text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
+inputs = {**audio_inputs, **video_inputs, **text_inputs}
+
+with torch.no_grad():
+    outputs = model(**inputs)
+
+print("audio-text:", outputs.logits_audio_text.sigmoid().tolist())
+print("video-text:", outputs.logits_video_text.sigmoid().tolist())
+print("audio-video:", outputs.logits_audio_video.sigmoid().tolist())
 ```
 
-## PeAudioVideoProcessor
+## Usage tips and notes
 
-[[autodoc]] PeAudioVideoProcessor
-    - __call__
+- [`PeAudioVideoModel`] requires at least two of `input_ids`, `input_values`, `pixel_values_videos` — if only two are provided it dispatches to the audio-only or video-only sub-model. Passing all three triggers the joint audio-video-text path and the full set of logit matrices in [`PeAudioVideoOutput`].
+- Audio uses `padding_mask` and video uses `padding_mask_videos` simultaneously. They are independent masks; do not conflate them with `attention_mask`, which is reserved for the text tower.
+- Audio–video alignment runs per-batch-element inside `_align_video_hidden_state`, so batches with very different audio/video lengths iterate rather than vectorizing. Keep batch items roughly balanced for throughput.
+- The text tower's weights are tied across branches via `_tied_weights_keys` — do not try to load separate text encoders for the audio and video halves.
 
 ## PeAudioVideoConfig
 
@@ -42,12 +70,16 @@ TODO
 
 [[autodoc]] PeAudioVideoEncoderConfig
 
-## PeAudioVideoModel
+## PeAudioVideoProcessor
 
-[[autodoc]] PeAudioVideoModel
-    - forward
+[[autodoc]] PeAudioVideoProcessor
 
 ## PeAudioVideoEncoder
 
 [[autodoc]] PeAudioVideoEncoder
     - forward
+
+## PeAudioVideoModel
+
+[[autodoc]] PeAudioVideoModel
+    - forward
diff --git a/docs/source/en/model_doc/pe_video.md b/docs/source/en/model_doc/pe_video.md
@@ -1,4 +1,4 @@
-<!--Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -15,44 +15,69 @@ rendered properly in your Markdown viewer.
 -->
 *This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.*
 
-# PE Video (Perception Encoder Video)
+# PE Video
 
-## Overview
+[PE Video](https://huggingface.co/papers/2504.13181) is the video branch of Meta's Perception Encoder family. It contrastively aligns video clips with text into a shared embedding space, enabling zero-shot video classification and video–text retrieval from a single pretrained backbone.
 
-TODO
+The encoder's rotary embeddings and patch embedder treat the temporal axis as a first-class dimension, so variable-length clips can be encoded without tiling each frame independently.
 
-## Usage
+You can find all the official PE Audio checkpoints under the [perception-encoder-audio-visual](https://huggingface.co/collections/facebook/perception-encoder-audio-visual) collection.
 
-### Basic usage
+## Quickstart
 
 ```py
-TODO
+import torch
+from transformers import AutoProcessor, PeVideoModel
+from transformers.video_utils import load_video
+
+processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
+model = PeVideoModel.from_pretrained(
+    "facebook/pe-av-large",
+    device_map="auto",
+)
+
+video, _ = load_video("https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4")
+labels = ["a person playing tennis", "a person cooking", "a cat sleeping"]
+
+video_inputs = processor.video_processor(video, num_frames=16, return_tensors="pt").to(model.device)
+text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
+inputs = {**video_inputs, **text_inputs}
+
+with torch.no_grad():
+    outputs = model(**inputs)
+
+probs = outputs.logits_video_text.sigmoid()
+print({label: p.item() for label, p in zip(labels, probs[0])})
 ```
 
-## PeVideoVideoProcessor
+## Usage tips and notes
 
-[[autodoc]] PeVideoVideoProcessor
-    - __call__
+- Variable-length videos use `padding_mask_videos` (not `attention_mask`). The video processor only pads and returns this mask when `return_tensors` is set — without it you get a list of per-clip tensors and no mask.
+- Pass `num_frames` to the video processor for fixed-length uniform sampling across `[0, total_frames-1]`. Omit it to fall back to fps-based sampling from the base class. Checkpoints are usually trained at a specific frame count, so match what the checkpoint expects.
+- Encoder input is `pixel_values_videos`. The encoder's `main_input_name` is `"pixel_values_videos"` while the full model's is `"input_ids"`, which matters when routing through generic utilities that inspect `main_input_name`.
 
-## PeVideoProcessor
+## PeVideoConfig
 
-[[autodoc]] PeVideoProcessor
-    - __call__
+[[autodoc]] PeVideoConfig
 
 ## PeVideoEncoderConfig
 
 [[autodoc]] PeVideoEncoderConfig
 
-## PeVideoConfig
+## PeVideoVideoProcessor
 
-[[autodoc]] PeVideoConfig
+[[autodoc]] PeVideoVideoProcessor
 
-## PeVideoModel
+## PeVideoProcessor
 
-[[autodoc]] PeVideoModel
-    - forward
+[[autodoc]] PeVideoProcessor
 
 ## PeVideoEncoder
 
 [[autodoc]] PeVideoEncoder
     - forward
+
+## PeVideoModel
+
+[[autodoc]] PeVideoModel
+    - forward