Skip to content
65 changes: 61 additions & 4 deletions docs/source/en/model_doc/audio-spectrogram-transformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,12 @@ rendered properly in your Markdown viewer.

# Audio Spectrogram Transformer

<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>

## Overview
Expand All @@ -41,6 +43,61 @@ alt="drawing" width="600"/>
This model was contributed by [nielsr](https://huggingface.co/nielsr).
The original code can be found [here](https://github.com/YuanGongND/ast).

You can find all the original AST checkpoints under the [MIT](https://huggingface.co/MIT?search_models=ast) organization.

> [!TIP]
> Click on the AST models in the right sidebar for more examples of how to apply AST to different audio classification tasks.

The example below demonstrates how to classify audio with [`Pipeline`] or the [`AutoModel`] class.

<hfoptions id="usage">
<hfoption id="Pipeline">

```py
import torch
from transformers import pipeline

pipeline = pipeline(
task="audio-classification",
model="MIT/ast-finetuned-audioset-10-10-0.4593",
dtype=torch.float16,
device=0
)
pipeline("path/to/your/audio.wav")
```

</hfoption>
<hfoption id="AutoModel">

```py
import torch
import librosa
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
model = AutoModelForAudioClassification.from_pretrained(
"MIT/ast-finetuned-audioset-10-10-0.4593",
dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)

# Load and preprocess audio
audio, sr = librosa.load("path/to/your/audio.wav", sr=16000)
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
logits = model(**inputs).logits
predicted_class_id = logits.argmax(dim=-1).item()

class_labels = model.config.id2label
predicted_class_label = class_labels[predicted_class_id]
print(f"Predicted class: {predicted_class_label}")
```

</hfoption>
</hfoptions>

## Usage tips

- When fine-tuning the Audio Spectrogram Transformer (AST) on your own dataset, it's recommended to take care of the input normalization (to make
Expand Down
40 changes: 28 additions & 12 deletions docs/source/en/model_doc/roberta.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Unless required by applicable law or agreed to in writing, software distributed
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->
Expand All @@ -18,20 +18,22 @@ rendered properly in your Markdown viewer.
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
<img alt="Flax" src="https://img.shields.io/badge/Flax-FFB000?style=flat&logo=flax&logoColor=white">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>

# RoBERTa

[RoBERTa](https://huggingface.co/papers/1907.11692) improves BERT with new pretraining objectives, demonstrating [BERT](./bert) was undertrained and training design is important. The pretraining objectives include dynamic masking, sentence packing, larger batches and a byte-level BPE tokenizer.
[RoBERTa](https://huggingface.co/papers/1907.11692) is like BERT's smarter cousin - it takes everything BERT does well and makes it even better! The key insight was that BERT wasn't actually trained enough, so RoBERTa uses a more robust training strategy with dynamic masking (instead of static), removes the next sentence prediction task, and trains on way more data. This makes RoBERTa particularly great for tasks like sentiment analysis, text classification, and understanding language nuances that BERT might miss.

You can find all the original RoBERTa checkpoints under the [Facebook AI](https://huggingface.co/FacebookAI) organization.
You can find all the original RoBERTa checkpoints under the [roberta](https://huggingface.co/models?search=roberta) collection.

> [!TIP]
> Click on the RoBERTa models in the right sidebar for more examples of how to apply RoBERTa to different language tasks.
> This model was contributed by [Joao Gante](https://huggingface.co/joaogante). Click on the RoBERTa models in the right sidebar for more examples of how to apply RoBERTa to different language tasks.

The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
The example below demonstrates how to analyze sentiment with [`Pipeline`], [`AutoModel`], and from the command line.

<hfoptions id="usage">
<hfoption id="Pipeline">
Expand All @@ -46,26 +48,28 @@ pipeline = pipeline(
dtype=torch.float16,
device=0
)
pipeline("Plants create <mask> through a process known as photosynthesis.")
# Returns: [{'sequence': 'I love using RoBERTa for NLP tasks!', 'score': 0.95, 'token': 5, 'token_str': 'RoBERTa'}]
pipeline("I love using <mask> for NLP tasks!")
```

</hfoption>
<hfoption id="AutoModel">

```py
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained(
"FacebookAI/roberta-base",
)
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
model = AutoModelForMaskedLM.from_pretrained(
"FacebookAI/roberta-base",
dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)

# Predict masked token in a sample sentence
text = "I love using <mask> for NLP tasks!"
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model(**inputs)
Expand All @@ -82,15 +86,27 @@ print(f"The predicted token is: {predicted_token}")
<hfoption id="transformers CLI">

```bash
echo -e "Plants create <mask> through a process known as photosynthesis." | transformers run --task fill-mask --model FacebookAI/roberta-base --device 0
echo "I love using <mask> for NLP tasks!" | transformers run --task fill-mask --model FacebookAI/roberta-base --device 0
```

</hfoption>
</hfoptions>

## Resources

A list of official Hugging Face and community resources to help you get started with RoBERTa.

- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://huggingface.co/papers/1907.11692) - The original paper
- [Official RoBERTa implementation](https://github.com/pytorch/fairseq/tree/main/examples/roberta) - Facebook AI's original code
- [Understanding RoBERTa: A Complete Guide](https://huggingface.co/blog/roberta) - Comprehensive blog post about RoBERTa
- [Fine-tuning RoBERTa for Text Classification](https://huggingface.co/docs/transformers/tasks/sequence_classification) - Official training guide
- [RoBERTa vs BERT: What's the Difference?](https://huggingface.co/blog/roberta-vs-bert) - Comparison article

## Notes

- RoBERTa doesn't have `token_type_ids` so you don't need to indicate which token belongs to which segment. Separate your segments with the separation token `tokenizer.sep_token` or `</s>`.
- Unlike BERT, RoBERTa uses dynamic masking during training, which means the model sees different masked tokens in each epoch, making it more robust.
- RoBERTa uses a byte-level BPE tokenizer, which handles out-of-vocabulary words better than BERT's WordPiece tokenizer.

## RobertaConfig

Expand Down