Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
2898de3
📝 start adding inference section to task guides
stevhliu Aug 26, 2022
de69778
✨ make style
stevhliu Aug 26, 2022
18c10ae
📝 add multiple choice
stevhliu Aug 31, 2022
41c4d7a
add rest of inference sections
stevhliu Sep 9, 2022
165834f
make style
stevhliu Sep 9, 2022
7e4fe03
add compute_metric, push_to_hub, pipeline
stevhliu Sep 29, 2022
c1aae79
make style
stevhliu Sep 29, 2022
bb1d74e
add updated sequence and token classification
stevhliu Sep 30, 2022
fdf0a06
make style
stevhliu Sep 30, 2022
8d1429e
make edits in token classification
stevhliu Oct 6, 2022
7262397
add audio classification
stevhliu Oct 7, 2022
ae18cdb
make style
stevhliu Oct 7, 2022
0aff67d
add asr
stevhliu Oct 7, 2022
389f239
make style
stevhliu Oct 7, 2022
b376bc4
add image classification
stevhliu Oct 11, 2022
375d26e
make style
stevhliu Oct 11, 2022
2791b83
add summarization
stevhliu Oct 11, 2022
b124440
make style
stevhliu Oct 11, 2022
d1ae774
add translation
stevhliu Oct 11, 2022
d78d9cd
make style
stevhliu Oct 11, 2022
42d8566
add multiple choice
stevhliu Oct 12, 2022
2459d16
add language modeling
stevhliu Oct 12, 2022
fafe707
add qa
stevhliu Oct 13, 2022
79efec0
make style
stevhliu Oct 13, 2022
a2a80fe
review and edits
stevhliu Oct 14, 2022
ae5a1b8
apply reviews
stevhliu Oct 28, 2022
aa27fb2
make style
stevhliu Oct 28, 2022
ccd298f
fix call to processor
stevhliu Nov 2, 2022
baaa937
apply audio reviews
stevhliu Nov 4, 2022
c76d834
update to better asr model
stevhliu Nov 18, 2022
c847946
make style
stevhliu Nov 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
234 changes: 187 additions & 47 deletions docs/source/en/tasks/asr.mdx

Large diffs are not rendered by default.

183 changes: 154 additions & 29 deletions docs/source/en/tasks/audio_classification.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,27 +14,44 @@ specific language governing permissions and limitations under the License.

<Youtube id="KWwzcmG98Ds"/>

Audio classification assigns a label or class to audio data. It is similar to text classification, except an audio input is continuous and must be discretized, whereas text can be split into tokens. Some practical applications of audio classification include identifying intent, speakers, and even animal species by their sounds.
Audio classification - just like with text - assigns a class label output from the input data. The only difference is instead of text inputs, you have raw audio waveforms. Some practical applications of audio classification include identifying speaker intent, language classification, and even animal species by their sounds.

This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) to classify intent.
This guide will show you how to:

1. Finetune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to classify speaker intent.
2. Use your finetuned model for inference.

<Tip>

See the audio classification [task page](https://huggingface.co/tasks/audio-classification) for more information about its associated models, datasets, and metrics.

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

```py
>>> from huggingface_hub import notebook_login

>>> notebook_login()
```

## Load MInDS-14 dataset

Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) from the 🤗 Datasets library:
Start by loading the MInDS-14 dataset from the 🤗 Datasets library:

```py
>>> from datasets import load_dataset, Audio

>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
```

Split this dataset into a train and test set:
Split the dataset's `train` split into a smaller train and test set with the [`~datasets.Dataset.train_test_split`] method. This'll give you a chance to experiment and make sure everything works before spending more time on the full dataset.

```py
>>> minds = minds.train_test_split(test_size=0.2)
Expand All @@ -56,7 +73,7 @@ DatasetDict({
})
```

While the dataset contains a lot of other useful information, like `lang_id` and `english_transcription`, you will focus on the `audio` and `intent_class` in this guide. Remove the other columns:
While the dataset contains a lot of useful information, like `lang_id` and `english_transcription`, you'll focus on the `audio` and `intent_class` in this guide. Remove the other columns with the [`~datasets.Dataset.remove_columns`] method:

```py
>>> minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])
Expand All @@ -73,7 +90,12 @@ Take a look at an example now:
'intent_class': 2}
```

The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file. The `intent_class` column is an integer that represents the class id of intent. Create a dictionary that maps a label name to an integer and vice versa. The mapping will help the model recover the label name from the label number:
There are two fields:

- `audio`: a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file.
- `intent_class`: represents the class id of the speaker's intent.

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

```py
>>> labels = minds["train"].features["intent_class"].names
Expand All @@ -83,26 +105,24 @@ The `audio` column contains a 1-dimensional `array` of the speech signal that mu
... id2label[str(i)] = label
```

Now you can convert the label number to a label name for more information:
Now you can convert the label id to a label name:

```py
>>> id2label[str(2)]
'app_error'
```

Each keyword - or label - corresponds to a number; `2` indicates `app_error` in the example above.

## Preprocess

Load the Wav2Vec2 feature extractor to process the audio signal:
The next step is to load a Wav2Vec2 feature extractor to process the audio signal:

```py
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```

The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sampling rate of 8000khz. You will need to resample the dataset to use the pretrained Wav2Vec2 model:
The MInDS-14 dataset has a sampling rate of 8000khz (you can find this information in it's [dataset card](https://huggingface.co/datasets/PolyAI/minds14)), which means you'll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:

```py
>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
Expand All @@ -114,11 +134,11 @@ The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sam
'intent_class': 2}
```

The preprocessing function needs to:
Now create a preprocessing function that:

1. Call the `audio` column to load and if necessary resample the audio file.
2. Check the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information on the Wav2Vec2 [model card](https://huggingface.co/facebook/wav2vec2-base).
3. Set a maximum input length so longer inputs are batched without being truncated.
1. Calls the `audio` column to load, and if necessary, resample the audio file.
2. Checks if the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information in the Wav2Vec2 [model card](https://huggingface.co/facebook/wav2vec2-base).
3. Set a maximum input length to batch longer inputs without truncating them.

```py
>>> def preprocess_function(examples):
Expand All @@ -129,18 +149,46 @@ The preprocessing function needs to:
... return inputs
```

Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects:
To apply the preprocessing function over the entire dataset, use 🤗 Datasets [`~datasets.Dataset.map`] function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that's the name the model expects:

```py
>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
>>> encoded_minds = encoded_minds.rename_column("intent_class", "label")
```

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

```py
>>> import evaluate

>>> accuracy = evaluate.load("accuracy")
```

Then create a function that passes your predictions and labels to [`~evaluate.EvaluationModule.compute`] to calculate the accuracy:

```py
>>> import numpy as np


>>> def compute_metrics(eval_pred):
... predictions = np.argmax(eval_pred.predictions, axis=1)
... return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)
```

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

<frameworkcontent>
<pt>
Load Wav2Vec2 with [`AutoModelForAudioClassification`]. Specify the number of labels, and pass the model the mapping between label number and label class:
<Tip>

If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!

</Tip>
You're ready to start training your model now! Load Wav2Vec2 with [`AutoModelForAudioClassification`] along with the number of expected labels, and the label mappings:

```py
>>> from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer
Expand All @@ -151,25 +199,28 @@ Load Wav2Vec2 with [`AutoModelForAudioClassification`]. Specify the number of la
... )
```

<Tip>

If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!

</Tip>

At this point, only three steps remain:

1. Define your training hyperparameters in [`TrainingArguments`].
2. Pass the training arguments to [`Trainer`] along with the model, datasets, and feature extractor.
3. Call [`~Trainer.train`] to fine-tune your model.
1. Define your training hyperparameters in [`TrainingArguments`]. The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [`Trainer`] will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [`~Trainer.train`] to finetune your model.


```py
>>> training_args = TrainingArguments(
... output_dir="./results",
... output_dir="my_awesome_mind_model",
... evaluation_strategy="epoch",
... save_strategy="epoch",
... learning_rate=3e-5,
... num_train_epochs=5,
... per_device_train_batch_size=32,
... gradient_accumulation_steps=4,
... per_device_eval_batch_size=32,
... num_train_epochs=10,
... warmup_ratio=0.1,
... logging_steps=10,
... load_best_model_at_end=True,
... metric_for_best_model="accuracy",
... push_to_hub=True,
... )

>>> trainer = Trainer(
Expand All @@ -178,15 +229,89 @@ At this point, only three steps remain:
... train_dataset=encoded_minds["train"],
... eval_dataset=encoded_minds["test"],
... tokenizer=feature_extractor,
... compute_metrics=compute_metrics,
... )

>>> trainer.train()
```

Once training is completed, share your model to the Hub with the [`~transformers.Trainer.push_to_hub`] method so everyone can use your model:

```py
>>> trainer.push_to_hub()
```
</pt>
</frameworkcontent>

<Tip>

For a more in-depth example of how to fine-tune a model for audio classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
For a more in-depth example of how to finetune a model for audio classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).

</Tip>

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Load an audio file you'd like to run inference on. Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!

```py
>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> audio_file = dataset[0]["audio"]["path"]
```

The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for audio classification with your model, and pass your audio file to it:

```py
>>> from transformers import pipeline

>>> classifier = pipeline("audio-classification", model="stevhliu/my_awesome_minds_model")
>>> classifier(audio_file)
[
{'score': 0.09766869246959686, 'label': 'cash_deposit'},
{'score': 0.07998877018690109, 'label': 'app_error'},
{'score': 0.0781070664525032, 'label': 'joint_account'},
{'score': 0.07667109370231628, 'label': 'pay_bill'},
{'score': 0.0755252093076706, 'label': 'balance'}
]
```

You can also manually replicate the results of the `pipeline` if you'd like:

<frameworkcontent>
<pt>
Load a feature extractor to preprocess the audio file and return the `input` as PyTorch tensors:

```py
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("stevhliu/my_awesome_minds_model")
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
```

Pass your inputs to the model and return the logits:

```py
>>> from transformers import AutoModelForAudioClassification

>>> model = AutoModelForAudioClassification.from_pretrained("stevhliu/my_awesome_minds_model")
>>> with torch.no_grad():
... logits = model(**inputs).logits
```

Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a label:

```py
>>> import torch

>>> predicted_class_ids = torch.argmax(logits).item()
>>> predicted_label = model.config.id2label[predicted_class_ids]
>>> predicted_label
'cash_deposit'
```
</pt>
</frameworkcontent>
Loading